Re: Using MapReduce to do table comparing.
I agree with you this is an acceptable method if time spent on exporting data from RDBM, importing file into HDFS and then importing data into RDBM again is considered as well, but this is an single-process/thread method. BTW, can you tell me how long does it take your method to process those 130 million rows, how much is the data volume, and how powerful are your physical computers, thanks a lot! -- From: Michael Lee [EMAIL PROTECTED] Sent: Thursday, July 24, 2008 11:51 AM To: core-user@hadoop.apache.org Subject: Re: Using MapReduce to do table comparing. Amber wrote: We have a 10 million row table exported from AS400 mainframe every day, the table is exported as a csv text file, which is about 30GB in size, then the csv file is imported into a RDBMS table which is dropped and recreated every day. Now we want to find how many rows are updated during each export-import interval, the table has a primary key, so deletes and inserts can be found using RDBMS joins quickly, but we must do a column to column comparing in order to find the difference between rows ( about 90%) with the same primary keys. Our goal is to find a comparing process which takes no more than 10 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz CPUs, 8GB memory and a 300G local RAID5 array. Bellow is our current solution: The old data is kept in the RDBMS with index created on the primary key, the new data is imported into HDFS as the input file of our Map-Reduce job. Every map task connects to the RDBMS database, and selects old data from it for every row, map tasks will generate outputs if differences are found, and there are no reduce tasks. As you can see, with the number of concurrent map tasks increasing, the RDBMS database will become the bottleneck, so we want to kick off the RDBMS, but we have no idea about how to retrieve the old row with a given key quickly from HDFS files, any suggestion is welcome. 10 million is not bad. I do this all the time in UDB 8.1 - multiple key columns and multiple value columns and calculate delta's - insert, delete and update. What other has suggested works ( I tried very crude version of what James Moore suggested in Hadoop with 70+ million records ) but you have to remember there are other costs ( dumping out files, putting into HDFS, etc. ). It might be better if you process straight in database or do a straight file processing. Also the key is avoiding transaction. If you are doing outside of database... you have 'old.csv' and 'new.csv' and sorted by primary keys ( when you extract make sure you do order by ). In your application, you open two file handlers and read one line at time. Create the keys. If the keys are the same, you compare two strings if they are the same. If key is not the same, you have to find out natural orders - it can be insert or delete. Once you decide, you read another line ( if insert/delete - you only read one line from one of the file ) Here is the pseudo code oldFile = File.new(oldFilename, r) newFile = File.new(newFilename, r) outFile = File.new(outFilename, w) oldLine = oldFile.gets newLine = newFile.gets while ( true ) { oldKey = convertToKey(oldLine) newKey = convertToKey(newLine) if ( oldKey newKey ) { ## it is deletion outFile.puts oldLine + , + DELETE; oldLine = oldFile.gets } elsif ( oldKey newKey ) { ## it is insert outFile.puts newLine + , + INSERT; newLine = newFile.gets } else { ## compare outFile.puts newLine + , + UPDATE if ( oldLine != newLine ) oldLine = oldFile.gets newLine = newFile.gets } } Okay - I skipped the part if eof is reached for each file but you get the point. If the both old and new are in database, you can open two databases connections and just do the process without dumping files. I journal about 130 million rows every day for quant financial database...
Re: Hadoop and Ganglia Meterics
Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Re: Using MapReduce to do table comparing.
Yes, I think this is the simplest method , but there are problems too: 1. The reduce stage wouldn't begin until the map stage ends, by when we have done a two table scanning, and the comparing will take almost the same time, because about 90% of intermediate key, value pairs will have two values and different keys, if I can specify a number n, by when there are n intermediate pairs with the same key the reduce tasks start, that will be better. In my case I will set the magic number to 2. 2. I am not sure about how Hadoop stores intermediate key, value pairs, we would not afford it as data volume increasing if it is kept in memory. -- From: James Moore [EMAIL PROTECTED] Sent: Thursday, July 24, 2008 1:12 AM To: core-user@hadoop.apache.org Subject: Re: Using MapReduce to do table comparing. On Wed, Jul 23, 2008 at 7:33 AM, Amber [EMAIL PROTECTED] wrote: We have a 10 million row table exported from AS400 mainframe every day, the table is exported as a csv text file, which is about 30GB in size, then the csv file is imported into a RDBMS table which is dropped and recreated every day. Now we want to find how many rows are updated during each export-import interval, the table has a primary key, so deletes and inserts can be found using RDBMS joins quickly, but we must do a column to column comparing in order to find the difference between rows ( about 90%) with the same primary keys. Our goal is to find a comparing process which takes no more than 10 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz CPUs, 8GB memory and a 300G local RAID5 array. Bellow is our current solution: The old data is kept in the RDBMS with index created on the primary key, the new data is imported into HDFS as the input file of our Map-Reduce job. Every map task connects to the RDBMS database, and selects old data from it for every row, map tasks will generate outputs if differences are found, and there are no reduce tasks. As you can see, with the number of concurrent map tasks increasing, the RDBMS database will become the bottleneck, so we want to kick off the RDBMS, but we have no idea about how to retrieve the old row with a given key quickly from HDFS files, any suggestion is welcome. Think of map/reduce as giving you a kind of key/value lookup for free - it just falls out of how the system works. You don't care about the RDBMS. It's a distraction - you're given a set of csv files with unique keys and dates, and you need to find the differences between them. Say the data looks like this: File for jul 10: 0x1,stuff 0x2,more stuff File for jul 11: 0x1,stuff 0x2,apples 0x3,parrot Preprocess the csv files to add dates to the values: File for jul 10: 0x1,20080710,stuff 0x2,20080710,more stuff File for jul 11: 0x1,20080711,stuff 0x2,20080711,apples 0x3,20080711,parrot Feed two days worth of these files into a hadoop job. The mapper splits these into k=0x1, v=20080710,stuff etc. The reducer gets one or two v's per key, and each v has the date embedded in it - that's essentially your lookup step. You'll end up with a system that can do compares for any two dates, and could easily be expanded to do all sorts of deltas across these files. The preprocess-the-files-to-add-a-date can probably be included as part of your mapper and isn't really a separate step - just depends on how easy it is to use one of the off-the-shelf mappers with your data. If it turns out to be its own step, it can become a very simple hadoop job. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: hadoop 0.17.1 reducer not fetching map output problem
Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem - whether it is a problem in the TT-JT communication or in the Task-TT communication. From your description it looks like the problem is between the JT-TT communication. But pls run the experiment when it happens again and let us know what happens. Thanks, Devaraj On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Hi! I'm experiencing hung reducers, with the following symptoms: Task Logs: 'task_200807230647_0008_r_09_1' stdout logs stderr logs syslog logs red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:11,064 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) Notice how it needs 6 map outputs, all map tasks have finished, and it still just hangs there. The second speculative copy of that reducer task needs 14 map outputs with the same messages :( Other
Re: Hadoop and Ganglia Meterics
Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: Hadoop and Ganglia Meterics
I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Anyway to order all the output folder?
Hi All, There are 30 output folders using Hadoop. Each folder it is in ascending order, but the order is not ascending among folders, like the value is 1, 5 , 10 in folder A and 6, 8, 9 in folder B. My question is how to enforce the order among all the folders as well, as output value 1, 5, 6 in folder A and 8, 9, 10 in folder B. I just start to learn Hadoop and hope you can help me. :) Thanks Shane
Re: Name node heap space problem
Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Anybody used AppNexus for hosting Hadoop app?
I discovered AppNexus yesterday. They offer hosting similar to Amazon EC2, with apparently more dedicated hardware and a better notion of where things are in the datacenter. Their web site says they are optimized for Hadoop applications. Anybody tried and could give some feedback? J.
Re: can hadoop read files backwards
I need some help with the implementation, to have the mapper produce key=id, value = type,timestamp which is essentially string, string what do i give output.collect for the Value, i want to store type, timestamp it only takes Text, IntWritable but i want to store Text, Text ? or what can i store in there. here is my reducer which doesn't work because output.collect doesn't want Text, Text public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private Text Key = new Text(); private Text Value = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); // line is parsed and now i have 2 strings // String S1; // contains the key // String S2; // contains the value Key.set(S1); Value.set(S2); output.collect(Key, Value); } } Miles Osborne wrote: unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form: key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id. for example, you could simply read them into a list and work on them in any manner you see fit. (note that hadoop is perfectly fine at dealing with multi-line items. all you need do is make sure that the items you want to process together all share the same key) Miles 2008/7/18 Elia Mazzawi [EMAIL PROTECTED]: well here is the problem I'm trying to solve, I have a data set that looks like this: IDtype Timestamp A1X 1215647404 A2X 1215647405 A3X 1215647406 A1 Y 1215647409 I want to count how many A1 Y, show up within 5 seconds of an A1 X I was planning to have the data sorted by ID then timestamp, then read it backwards, (or have it sorted by reverse timestamp) go through it cashing all Y's for the same ID for 5 seconds to either find a matching X or not. the results don't need to be 100% accurate. so if hadoop gives the same file with the same lines in order then this will work. seems hadoop is really good at solving problems that depend on 1 line at a time? but not multi lines? hadoop has to get data in order, and be able to work on multi lines, otherwise how can it be setting records in data sorts. I'd appreciate other suggestions to go about doing this. Jim R. Wilson wrote: does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? You can't really depend on the order that the lines are given - it's best to think of them as random. The purpose of MapReduce/Hadoop is to distribute a problem among a number of cooperating nodes. The idea is that any given line can be interpreted separately, completely independent of any other line. So in wordcount, this makes sense. For example, say you and I are nodes. Each of us gets half the lines in a file and we can count the words we see and report on them - it doesn't matter what order we're given the lines, or which lines we're given, or even whether we get the same number of lines (if you're faster at it, or maybe you get shorter lines, you may get more lines to process in the interest of saving time). So if the project you're working on requires getting the lines in a particular order, then you probably need to rethink your approach. It may be that hadoop isn't right for your problem, or maybe that the problem just needs to be attacked in a different way. Without knowing more about what you're trying to achieve, I can't offer any specifics. Good luck! -- Jim On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: I have a program based on wordcount.java and I have files that are smaller than 64mb files (so i believe each file is one task ) do does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? Jim R. Wilson wrote: It sounds to me like you're talking about hadoop streaming (correct me if I'm wrong there). In that case, there's really no order to the lines being doled out as I understand it. Any given line could be handed to any given mapper task running on any given node. I may be wrong, of course, someone closer to the project could give you the right answer in that case. -- Jim R. Wilson (jimbojw) On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: is there a way to have hadoop hand over the lines of a file backwards to my mapper ? as in give the last line first.
Re: hadoop 0.17.1 reducer not fetching map output problem
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem - whether it is a problem in the TT-JT communication or in the Task-TT communication. From your description it looks like the problem is between the JT-TT communication. But pls run the experiment when it happens again and let us know what happens. Well, I did restart the tasktracker where the reduce job was running, but that lead only to a situation where the jobtracker did not restart the job, showed it as still running, and was not able to kill the reduce task via hadoop job -kill-task nor -fail-task. I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer at another startup confirmed the whole batch of problems I've been experiencing, and for him 0.15 works for production. rant-mode No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've introduced reducing to our workloads, and before 0.16 failed 80% of the jobs with reducers not being able to get their output. 0.17.0 improved that to a point where one can, with some pain, e.g. restarting the cluster daily, not storing anything important on HDFS, only temporary data, ..., use it somehow for production, at least for small jobs.) So one wonders how 0.16 got released? Or was it meant only as developer-only bug fixing series? /rant-mode Sorry, this has been driving me up the walls into an asylum till I compared notes with a collegue, and decided that I'm not crazy ;) Andreas Thanks, Devaraj On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Hi! I'm experiencing hung reducers, with the following symptoms: Task Logs: 'task_200807230647_0008_r_09_1' stdout logs stderr logs syslog logs red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:11,064 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:36,114 INFO
Bean Scripting Framework?
Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows two-way communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this work to allow using ruby in the m/r pipeline. For now, I don't need a fully general solution, I'd just like to call some ruby in my map or reduce tasks. Thanks! -lincoln -- lincolnritter.com
Re: hadoop 0.17.1 reducer not fetching map output problem
On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem - whether it is a problem in the TT-JT communication or in the Task-TT communication. From your description it looks like the problem is between the JT-TT communication. But pls run the experiment when it happens again and let us know what happens. Well, I did restart the tasktracker where the reduce job was running, but that lead only to a situation where the jobtracker did not restart the job, showed it as still running, and was not able to kill the reduce task via hadoop job -kill-task nor -fail-task. The reduce task would eventually be reexecuted (after some timeout, defaulting to 10 minutes, the tasktracker would be assumed as lost and all reducers that were running on that node would be reexecuted). I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer at another startup confirmed the whole batch of problems I've been experiencing, and for him 0.15 works for production. rant-mode No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've introduced reducing to our workloads, and before 0.16 failed 80% of the jobs with reducers not being able to get their output. 0.17.0 improved that to a point where one can, with some pain, e.g. restarting the cluster daily, not storing anything important on HDFS, only temporary data, ..., use it somehow for production, at least for small jobs.) So one wonders how 0.16 got released? Or was it meant only as developer-only bug fixing series? /rant-mode Pls raise jiras for the specific problems. Sorry, this has been driving me up the walls into an asylum till I compared notes with a collegue, and decided that I'm not crazy ;) Andreas Thanks, Devaraj On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Hi! I'm experiencing hung reducers, with the following symptoms: Task Logs: 'task_200807230647_0008_r_09_1' stdout logs stderr logs syslog logs red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:11,064 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:31,104 INFO
Re: Hadoop and Ganglia Meterics
Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Re: hadoop 0.17.1 reducer not fetching map output problem
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote: On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem - whether it is a problem in the TT-JT communication or in the Task-TT communication. From your description it looks like the problem is between the JT-TT communication. But pls run the experiment when it happens again and let us know what happens. Well, I did restart the tasktracker where the reduce job was running, but that lead only to a situation where the jobtracker did not restart the job, showed it as still running, and was not able to kill the reduce task via hadoop job -kill-task nor -fail-task. The reduce task would eventually be reexecuted (after some timeout, defaulting to 10 minutes, the tasktracker would be assumed as lost and all reducers that were running on that node would be reexecuted). I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer at another startup confirmed the whole batch of problems I've been experiencing, and for him 0.15 works for production. rant-mode No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've introduced reducing to our workloads, and before 0.16 failed 80% of the jobs with reducers not being able to get their output. 0.17.0 improved that to a point where one can, with some pain, e.g. restarting the cluster daily, not storing anything important on HDFS, only temporary data, ..., use it somehow for production, at least for small jobs.) So one wonders how 0.16 got released? Or was it meant only as developer-only bug fixing series? /rant-mode Pls raise jiras for the specific problems. I know, that's why I bracketed it as rantmode. OTOH, many of these issues had either this creepy feeling where you wondered if you did something wrong or were issues where I had to react relatively quickly, which usually destroys the faulty state. (I know, as a developer having reproduced a bug is golden. As an admin asked about processing lag, it's rather to opposite) Plus fixing the issue in the next release or even via a patch means that I have a non-working cluster till then. Now I that means I would need to start debugging the cluster utility software instead of our apps. ;( Andreas signature.asc Description: This is a digitally signed message part.
Re: Hadoop and Ganglia Meterics
Sweet, thanks. Jason Venner wrote: Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
about the overhead
Hi all, Does hadoop provide a way to let the users know the time for computation(map/reduce functions) and the time for different types of overhead (such as the startup, sorting, i/o disk, etc.) respectively? Thanks~~ Best regards, -- --- Wei
Hadoop DFS
Hi, I am new to Hadoop. Right now, I am Only interested to Work with Hadoop DFS. Can some one guide me where to start? Anyone has information about some application has already integrated Hadoop DFS ? Any information regarding Material about Hadoop DFS, case studies, Articles, books etc will be very nice. Thanks, Wasim
Re: Hadoop and Ganglia Meterics
Ah, yeah, I found that one. :) Patching 'java/org/apache/hadoop/mapred/JobInProgress.java' on 0.17.1. -joe Jason Venner wrote: I have only applied this patch as far forward as 0.16.0 Joe Williams wrote: Sweet, thanks. Jason Venner wrote: Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version=1.0 encoding=ISO-8859-1 standalone=yes? !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)* !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: Hadoop and Fedora Core 6 Adventure, Need Help ASAP
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml configuration property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://localhost:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description /property property namemapred.job.tracker/name valuelocalhost:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value1/value description define mapred.map tasks to be number of slave hosts /description /property property namemapred.reduce.tasks/name value1/value description define mapred.reduce tasks to be number of slave hosts /description /property property namedfs.replication/name value1/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property property namemapred.child.java.opts/name value-Xmx1800m/value descriptionJava opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] /description /property /configuration
Re: can hadoop read files backwards
never mind i got it. Elia Mazzawi wrote: I need some help with the implementation, to have the mapper produce key=id, value = type,timestamp which is essentially string, string what do i give output.collect for the Value, i want to store type, timestamp it only takes Text, IntWritable but i want to store Text, Text ? or what can i store in there. here is my reducer which doesn't work because output.collect doesn't want Text, Text public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private Text Key = new Text(); private Text Value = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); // line is parsed and now i have 2 strings // String S1; // contains the key // String S2; // contains the value Key.set(S1); Value.set(S2); output.collect(Key, Value); } } Miles Osborne wrote: unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form: key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id. for example, you could simply read them into a list and work on them in any manner you see fit. (note that hadoop is perfectly fine at dealing with multi-line items. all you need do is make sure that the items you want to process together all share the same key) Miles 2008/7/18 Elia Mazzawi [EMAIL PROTECTED]: well here is the problem I'm trying to solve, I have a data set that looks like this: IDtype Timestamp A1X 1215647404 A2X 1215647405 A3X 1215647406 A1 Y 1215647409 I want to count how many A1 Y, show up within 5 seconds of an A1 X I was planning to have the data sorted by ID then timestamp, then read it backwards, (or have it sorted by reverse timestamp) go through it cashing all Y's for the same ID for 5 seconds to either find a matching X or not. the results don't need to be 100% accurate. so if hadoop gives the same file with the same lines in order then this will work. seems hadoop is really good at solving problems that depend on 1 line at a time? but not multi lines? hadoop has to get data in order, and be able to work on multi lines, otherwise how can it be setting records in data sorts. I'd appreciate other suggestions to go about doing this. Jim R. Wilson wrote: does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? You can't really depend on the order that the lines are given - it's best to think of them as random. The purpose of MapReduce/Hadoop is to distribute a problem among a number of cooperating nodes. The idea is that any given line can be interpreted separately, completely independent of any other line. So in wordcount, this makes sense. For example, say you and I are nodes. Each of us gets half the lines in a file and we can count the words we see and report on them - it doesn't matter what order we're given the lines, or which lines we're given, or even whether we get the same number of lines (if you're faster at it, or maybe you get shorter lines, you may get more lines to process in the interest of saving time). So if the project you're working on requires getting the lines in a particular order, then you probably need to rethink your approach. It may be that hadoop isn't right for your problem, or maybe that the problem just needs to be attacked in a different way. Without knowing more about what you're trying to achieve, I can't offer any specifics. Good luck! -- Jim On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: I have a program based on wordcount.java and I have files that are smaller than 64mb files (so i believe each file is one task ) do does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? Jim R. Wilson wrote: It sounds to me like you're talking about hadoop streaming (correct me if I'm wrong there). In that case, there's really no order to the lines being doled out as I understand it. Any given line could be handed to any given mapper task running on any given node. I may be wrong, of course, someone closer to the project could give you the right answer in that case. -- Jim R. Wilson (jimbojw) On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: is there a way to have hadoop hand over the lines of a file backwards to my mapper ? as in give the last line first.
Help Need to get Hadoop on Fedora Core 6
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml configuration property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://localhost:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description /property property namemapred.job.tracker/name valuelocalhost:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value1/value description define mapred.map tasks to be number of slave hosts /description /property property namemapred.reduce.tasks/name value1/value description define mapred.reduce tasks to be number of slave hosts /description /property property namedfs.replication/name value1/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property property namemapred.child.java.opts/name value-Xmx1800m/value descriptionJava opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] /description /property /configuration
Re: Bean Scripting Framework?
On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote: Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows two-way communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this work to allow using ruby in the m/r pipeline. For now, I don't need a fully general solution, I'd just like to call some ruby in my map or reduce tasks. Why not use jruby? AFAIK, there is a complete ruby implementation on top of Java, and although I have not used it, I'd presume that it allows full usage of Java classes, as Jython does. Andreas signature.asc Description: This is a digitally signed message part.
Re: Bean Scripting Framework?
On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote: Why not use jruby? Indeed! I'm basically working from the JRuby wiki page on Java integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking this one step at a time and, while I would love tighter integration, the recommended way is through the scripting frameworks. Right now, I most interested in taking some baby steps before going more general. I welcome any and all feedback/suggestions. Especially if you have tried this. I will post any results if there is interest, but mostly I am trying to accomplish a pretty small task and am not yet thinking about a more general solution. Guess I won't be a big resource for you then, the only thing that I did was implementing a tar program with Jython that creates/extracts from/to HDFS. It was painful, but not to painful, and it's not Jythons fault, it's just that using these clunky interfaces/classes is painful to a Python developer. Guess the same feeling will come from Ruby developers. (and that's not a problem of Hadoop, I think that most Java APIs feel clunky to people used to more powerful languages. :-P) Andreas signature.asc Description: This is a digitally signed message part.
Trying to write to HDFS from mapreduce.
Hi! I'm writing a mapreduce job where I want the output from the mapper to go strait to the HDFS without passing the reduce method. Have been told that I can do: c.setOutputFormat(TextOutputFormat.class); also added Path path = new Path(user); FileOutputFormat.setOutputPath(c, path); But I still ended up with the result in the local filesystem instead. Regards Erik
Re: Trying to write to HDFS from mapreduce.
I think your conf is incorrectly set and your job was run locally. Also, have you done jobconf.setNumReduceTasks(0)? Try running some example jobs to test your setting. Nicholas Sze - Original Message From: Erik Holstad [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, July 24, 2008 3:17:40 PM Subject: Trying to write to HDFS from mapreduce. Hi! I'm writing a mapreduce job where I want the output from the mapper to go strait to the HDFS without passing the reduce method. Have been told that I can do: c.setOutputFormat(TextOutputFormat.class); also added Path path = new Path(user); FileOutputFormat.setOutputPath(c, path); But I still ended up with the result in the local filesystem instead. Regards Erik
Re: Bean Scripting Framework?
Andreas, If you wouldn't mind posting some snippets that would be great! There seems to be a general lack of examples out there so pretty much anything would help. -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 3:06 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote: Why not use jruby? Indeed! I'm basically working from the JRuby wiki page on Java integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking this one step at a time and, while I would love tighter integration, the recommended way is through the scripting frameworks. Right now, I most interested in taking some baby steps before going more general. I welcome any and all feedback/suggestions. Especially if you have tried this. I will post any results if there is interest, but mostly I am trying to accomplish a pretty small task and am not yet thinking about a more general solution. Guess I won't be a big resource for you then, the only thing that I did was implementing a tar program with Jython that creates/extracts from/to HDFS. It was painful, but not to painful, and it's not Jythons fault, it's just that using these clunky interfaces/classes is painful to a Python developer. Guess the same feeling will come from Ruby developers. (and that's not a problem of Hadoop, I think that most Java APIs feel clunky to people used to more powerful languages. :-P) Andreas
Re: Bean Scripting Framework?
Funny you should mention it - I'm working on a framework to do JRuby Hadoop this week. Something like: class MyHadoopJob Radoop input_format :text_input_format output_format :text_output_format map_output_key_class :text map_output_value_class :text def mapper(k, v, output, reporter) # ... end def reducer(k, vs, output, reporter) end end Plus a java glue file to call the Ruby stuff. And then it jars up the ruby files, the gem directory, and goes from there. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Bean Scripting Framework?
Well that sounds awesome! It would be simply splendid to see what you've got if you're willing to share. Are you going the 'direct' embedding route or using a scripting frame work (BSF or javax.script)? -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 3:42 PM, James Moore [EMAIL PROTECTED] wrote: Funny you should mention it - I'm working on a framework to do JRuby Hadoop this week. Something like: class MyHadoopJob Radoop input_format :text_input_format output_format :text_output_format map_output_key_class :text map_output_value_class :text def mapper(k, v, output, reporter) # ... end def reducer(k, vs, output, reporter) end end Plus a java glue file to call the Ruby stuff. And then it jars up the ruby files, the gem directory, and goes from there. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Need help to setup Hadoop on Fedora Core 6
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml configuration property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://localhost:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description /property property namemapred.job.tracker/name valuelocalhost:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value1/value description define mapred.map tasks to be number of slave hosts /description /property property namemapred.reduce.tasks/name value1/value description define mapred.reduce tasks to be number of slave hosts /description /property property namedfs.replication/name value1/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property property namemapred.child.java.opts/name value-Xmx1800m/value descriptionJava opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] /description /property /configuration
Re: Name node heap space problem
Check how much memory is allocated for the JVM running namenode. In a file HADOOP_INSTALL/conf/hadoop-env.sh you should change a line that starts with export HADOOP_HEAPSIZE=1000 It's set to 1GB by default. On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer [EMAIL PROTECTED] wrote: Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Re: Bean Scripting Framework?
On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter [EMAIL PROTECTED] wrote: Well that sounds awesome! It would be simply splendid to see what you've got if you're willing to share. I'll be happy to share, but it's pretty much in pieces, not ready for release. I'll put it out with whatever license Hadoop itself uses (presumably Apache). Are you going the 'direct' embedding route or using a scripting frame work (BSF or javax.script)? JSR233 is the way to go according to the JRuby guys at RailsConf last month. It's pretty straightforward - see http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29 -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Bean Scripting Framework?
Why dont you use hadoop streaming? - Original Message From: Lincoln Ritter [EMAIL PROTECTED] To: core-user core-user@hadoop.apache.org Sent: Friday, July 25, 2008 1:10:20 AM Subject: Bean Scripting Framework? Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows two-way communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this work to allow using ruby in the m/r pipeline. For now, I don't need a fully general solution, I'd just like to call some ruby in my map or reduce tasks. Thanks! -lincoln -- lincolnritter.com