Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber
I agree with you this is an acceptable method if time spent on exporting data 
from RDBM, importing file into HDFS and then importing data into RDBM again is 
considered as well, but this is an single-process/thread method. BTW, can you 
tell me how long does it take your method to process those 130 million rows, 
how much is the data volume, and how powerful are your physical computers, 
thanks a lot!
--
From: Michael Lee [EMAIL PROTECTED]
Sent: Thursday, July 24, 2008 11:51 AM
To: core-user@hadoop.apache.org
Subject: Re: Using MapReduce to do table comparing.

 Amber wrote:
 We have a 10 million row table exported from AS400 mainframe every day, the 
 table is exported as a csv text file, which is about 30GB in size, then the 
 csv file is imported into a RDBMS table which is dropped and recreated every 
 day. Now we want to find how many rows are updated during each export-import 
 interval, the table has a primary key, so deletes and inserts can be found 
 using RDBMS joins quickly, but we must do a column to column comparing in 
 order to find the difference between rows ( about 90%) with the same primary 
 keys. Our goal is to find a comparing process which takes no more than 10 
 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
 CPUs, 8GB memory  and a  300G local  RAID5 array.

 Bellow is our current solution:
 The old data is kept in the RDBMS with index created on the primary key, 
 the new data is imported into HDFS as the input file of our Map-Reduce job. 
 Every map task connects to the RDBMS database, and selects old data from it 
 for every row, map tasks will generate outputs if differences are found, and 
 there are no reduce tasks.

 As you can see, with the number of concurrent map tasks increasing, the 
 RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
 but we have no idea about how to retrieve the old row with a given key 
 quickly from HDFS files, any suggestion is welcome.
 10 million is not bad.  I do this all the time in UDB 8.1 - multiple key 
 columns and multiple value columns and calculate delta's - insert, 
 delete and update.
 
 What other has suggested works ( I tried very crude version of what 
 James Moore suggested in Hadoop with 70+ million records ) but you have 
 to remember there are other costs ( dumping out files, putting into 
 HDFS, etc. ).  It might be better if you process straight in database or 
 do a straight file processing. Also the key is avoiding transaction.
 
 If you are doing outside of database...
 
 you have 'old.csv' and 'new.csv' and sorted by primary keys ( when you 
 extract make sure you do order by ).  In your application, you open two 
 file handlers and read one line at time.  Create the keys.  If the keys 
 are the same, you compare two strings if they are the same.  If key is 
 not the same, you have to find out natural orders - it can be insert or 
 delete.  Once you decide, you read another line ( if insert/delete - you 
 only read one line from one of the file )
 
 Here is the pseudo code
 
 oldFile = File.new(oldFilename, r)
 newFile = File.new(newFilename, r)
 outFile = File.new(outFilename, w)
 
 oldLine = oldFile.gets
 newLine = newFile.gets
 
 while ( true )
 {
oldKey = convertToKey(oldLine)
newKey = convertToKey(newLine)
   
if ( oldKey  newKey )
{
   ## it is deletion
   outFile.puts oldLine + , + DELETE;  
   oldLine = oldFile.gets
}
elsif ( oldKey  newKey )
{
   ## it is insert
   outFile.puts newLine + , + INSERT;
   newLine = newFile.gets
}
else
{
   ## compare
   outFile.puts newLine + , + UPDATE if ( oldLine != newLine )
 
   oldLine = oldFile.gets
   newLine = newFile.gets
}
 }
 
 Okay - I skipped the part if eof is reached for each file but you get 
 the point.
 
 If the both old and new are in database, you can open two databases 
 connections and just do the process without dumping files.
 
 I journal about 130 million rows every day for quant financial database...
 
 
 
 
 
 

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have been 
unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a gmetric 
script run in a cron to update the stats? If so, does anyone have a 
hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe






Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber
Yes, I think this is the simplest method , but there are problems too:

1. The reduce stage wouldn't begin until the map stage ends, by when we have 
done a two table scanning, and the comparing will take almost the same time, 
because about 90% of intermediate key, value pairs will have two values and 
different keys, if I can specify a number n, by when there are n intermediate 
pairs with the same key the reduce tasks start, that will be better. In my case 
I will set the magic number to 2.

2. I am not sure about how Hadoop stores intermediate key, value pairs, we 
would not afford it as data volume increasing if it is kept in memory.

--
From: James Moore [EMAIL PROTECTED]
Sent: Thursday, July 24, 2008 1:12 AM
To: core-user@hadoop.apache.org
Subject: Re: Using MapReduce to do table comparing.

 On Wed, Jul 23, 2008 at 7:33 AM, Amber [EMAIL PROTECTED] wrote:
 We have a 10 million row table exported from AS400 mainframe every day, the 
 table is exported as a csv text file, which is about 30GB in size, then the 
 csv file is imported into a RDBMS table which is dropped and recreated every 
 day. Now we want to find how many rows are updated during each export-import 
 interval, the table has a primary key, so deletes and inserts can be found 
 using RDBMS joins quickly, but we must do a column to column comparing in 
 order to find the difference between rows ( about 90%) with the same primary 
 keys. Our goal is to find a comparing process which takes no more than 10 
 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
 CPUs, 8GB memory  and a  300G local  RAID5 array.

 Bellow is our current solution:
The old data is kept in the RDBMS with index created on the primary key, 
 the new data is imported into HDFS as the input file of our Map-Reduce job. 
 Every map task connects to the RDBMS database, and selects old data from it 
 for every row, map tasks will generate outputs if differences are found, and 
 there are no reduce tasks.

 As you can see, with the number of concurrent map tasks increasing, the 
 RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
 but we have no idea about how to retrieve the old row with a given key 
 quickly from HDFS files, any suggestion is welcome.
 
 Think of map/reduce as giving you a kind of key/value lookup for free
 - it just falls out of how the system works.
 
 You don't care about the RDBMS.  It's a distraction - you're given a
 set of csv files with unique keys and dates, and you need to find the
 differences between them.
 
 Say the data looks like this:
 
 File for jul 10:
 0x1,stuff
 0x2,more stuff
 
 File for jul 11:
 0x1,stuff
 0x2,apples
 0x3,parrot
 
 Preprocess the csv files to add dates to the values:
 
 File for jul 10:
 0x1,20080710,stuff
 0x2,20080710,more stuff
 
 File for jul 11:
 0x1,20080711,stuff
 0x2,20080711,apples
 0x3,20080711,parrot
 
 Feed two days worth of these files into a hadoop job.
 
 The mapper splits these into k=0x1, v=20080710,stuff etc.
 
 The reducer gets one or two v's per key, and each v has the date
 embedded in it - that's essentially your lookup step.
 
 You'll end up with a system that can do compares for any two dates,
 and could easily be expanded to do all sorts of deltas across these
 files.
 
 The preprocess-the-files-to-add-a-date can probably be included as
 part of your mapper and isn't really a separate step - just depends on
 how easy it is to use one of the off-the-shelf mappers with your data.
 If it turns out to be its own step, it can become a very simple
 hadoop job.
 
 -- 
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com
 

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Devaraj Das
Could you try to kill the tasktracker hosting the task the next time when it
happens? I just want to isolate the problem - whether it is a problem in the
TT-JT communication or in the Task-TT communication. From your description
it looks like the problem is between the JT-TT communication. But pls run
the experiment when it happens again and let us know what happens.

Thanks,
Devaraj


On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:

 Hi!
 
 I'm experiencing hung reducers, with the following symptoms:
 
 Task Logs: 'task_200807230647_0008_r_09_1'
 
 
 stdout logs
 
 
 
 stderr logs
 
 
 
 syslog logs
 
 red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
 location(s); scheduling... 2008-07-24 07:56:11,064 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:16,074 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:21,084 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:26,094 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:31,104 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:36,114 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:41,126 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
 
 
 Notice how it needs 6 map outputs, all map tasks have finished, and it still
 just hangs there.
 
 The second speculative copy of that reducer task needs 14 map outputs with the
 same messages :(
 
 Other 

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams
Thanks Jason, until this is implemented are how are you pulling stats 
from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have been 
unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a gmetric 
script run in a cron to update the stats? If so, does anyone have a 
hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe






--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]



Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling stats 
from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does anyone 
have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe








Anyway to order all the output folder?

2008-07-24 Thread Xing

Hi All,

There are 30 output folders using Hadoop. Each folder it is in ascending 
order, but the order is not ascending among folders, like the value is 
1, 5 , 10 in folder A and 6, 8, 9 in folder B.
My question is how to enforce the order among all the folders as well, 
as output value 1, 5, 6 in folder A and 8, 9, 10 in folder B.

I just start to learn Hadoop and hope you can help me. :)
Thanks

Shane


Re: Name node heap space problem

2008-07-24 Thread Gert Pfeifer

Update on this one...

I put some more memory in the machine running the name node. Now fsck is 
running. Unfortunately ls fails with a time-out.


I identified one directory that causes the trouble. I can run fsck on it 
but not ls.


What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,
I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert




Anybody used AppNexus for hosting Hadoop app?

2008-07-24 Thread jeremy.huylebroeck

I discovered AppNexus yesterday.
They offer hosting similar to Amazon EC2, with apparently more dedicated
hardware and a better notion of where things are in the datacenter.

Their web site says they are optimized for Hadoop applications.

Anybody tried and could give some feedback?
 
J.


Re: can hadoop read files backwards

2008-07-24 Thread Elia Mazzawi

I need some help with the implementation,  to have the mapper produce
key=id, value = type,timestamp
which is essentially string, string

what do i give output.collect for the Value,  i want to store type, 
timestamp it only takes Text, IntWritable but i want to store Text, 
Text ? or what can i store in there.


here is my reducer which doesn't work because output.collect doesn't 
want Text, Text


   public static class Map extends MapReduceBase implements 
MapperLongWritable, Text, Text, IntWritable {

   private Text Key = new Text();
   private Text Value = new Text();

   public void map(LongWritable key, Text value, 
OutputCollectorText, IntWritable output, Reporter reporter) throws 
IOException {

   String line = value.toString();

//   line is parsed and now i have 2 strings
//   String S1;   // contains the key
//   String S2;  // contains  the value

   Key.set(S1);
   Value.set(S2);
   output.collect(Key, Value);
   }
   }


Miles Osborne wrote:

unless you have a gigantic number of items with the same id, this is
straightforward.  have a mapper emit items of the form:

key=id, value = type,timestamp

and your reducer will then see all ids that have the same value together.
it is then a simple matter to process all items with the same id.  for
example, you could simply read them into a list and work on them in any
manner you see fit.

(note that hadoop is perfectly fine at dealing with multi-line items.  all
you need do is make sure that the items you want to process together all
share the same key)

Miles

2008/7/18 Elia Mazzawi [EMAIL PROTECTED]:

  

well here is the problem I'm trying to solve,

I have a data set that looks like this:

IDtype   Timestamp

A1X   1215647404
A2X   1215647405
A3X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I was planning to have the data sorted by ID then timestamp,
then read it backwards,  (or have it sorted by reverse timestamp)

go through it cashing all Y's for the same ID for 5 seconds to either find
a matching X or not.

the results don't need to be 100% accurate.

so if hadoop gives the same file with the same lines in order then this
will work.

seems hadoop is really good at solving problems that depend on 1 line at a
time? but not multi lines?

hadoop has to get data in order, and be able to work on multi lines,
otherwise how can it be setting records in data sorts.

I'd appreciate other suggestions to go about doing this.

Jim R. Wilson wrote:



does wordcount get the lines in order? or are they random? can i have
  

hadoop return them in reverse order?




You can't really depend on the order that the lines are given - it's
best to think of them as random.  The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.

The idea is that any given line can be interpreted separately,
completely independent of any other line.  So in wordcount, this makes
sense.  For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).

So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way.  Without knowing
more about what you're trying to achieve, I can't offer any specifics.

Good luck!

-- Jim

On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:


  

I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe each file
is
one task )

do does wordcount get the lines in order? or are they random? can i have
hadoop return them in reverse order?

Jim R. Wilson wrote:




It sounds to me like you're talking about hadoop streaming (correct me
if I'm wrong there).  In that case, there's really no order to the
lines being doled out as I understand it.  Any given line could be
handed to any given mapper task running on any given node.

I may be wrong, of course, someone closer to the project could give
you the right answer in that case.

-- Jim R. Wilson (jimbojw)

On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:



  

is there a way to have hadoop hand over the lines of a file backwards
to
my
mapper ?

as in give the last line first.








  




Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
 Could you try to kill the tasktracker hosting the task the next time when
 it happens? I just want to isolate the problem - whether it is a problem in
 the TT-JT communication or in the Task-TT communication. From your
 description it looks like the problem is between the JT-TT communication.
 But pls run the experiment when it happens again and let us know what
 happens.

Well, I did restart the tasktracker where the reduce job was running, but that 
lead only to a situation where the jobtracker did not restart the job, showed 
it as still running, and was not able to kill the reduce task via hadoop 
job -kill-task nor -fail-task.

I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer 
at another startup confirmed the whole batch of problems I've been 
experiencing, and for him 0.15 works for production.

rant-mode
No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 
could get released? (I'm using streaming.jar, and with 0.16.x I've introduced 
reducing to our workloads, and before 0.16 failed 80% of the jobs with 
reducers not being able to get their output. 0.17.0 improved that to a point 
where one can, with some pain, e.g. restarting the cluster daily, not storing 
anything important on HDFS, only temporary data, ..., use it somehow for 
production, at least for small jobs.) So one wonders how 0.16 got released? 
Or was it meant only as developer-only bug fixing series?
/rant-mode

Sorry, this has been driving me up the walls into an asylum till I compared 
notes with a collegue, and decided that I'm not crazy ;)

Andreas


 Thanks,
 Devaraj

 On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
  Hi!
 
  I'm experiencing hung reducers, with the following symptoms:
  Task Logs: 'task_200807230647_0008_r_09_1'
 
 
  stdout logs
 
 
 
  stderr logs
 
 
 
  syslog logs
 
  red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
  location(s); scheduling... 2008-07-24 07:56:11,064 INFO
  org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
  Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
  07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
  07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
  map-outputs from tasktracker and 0 map-outputs from previous failures
  2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Got 0 known map output location(s);
  scheduling... 2008-07-24 07:56:16,074 INFO
  org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
  Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
  07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
  07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
  map-outputs from tasktracker and 0 map-outputs from previous failures
  2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Got 0 known map output location(s);
  scheduling... 2008-07-24 07:56:21,084 INFO
  org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
  Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
  07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
  07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
  map-outputs from tasktracker and 0 map-outputs from previous failures
  2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Got 0 known map output location(s);
  scheduling... 2008-07-24 07:56:26,094 INFO
  org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
  Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
  07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
  07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
  map-outputs from tasktracker and 0 map-outputs from previous failures
  2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Got 0 known map output location(s);
  scheduling... 2008-07-24 07:56:31,104 INFO
  org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
  Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
  07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
  task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
  07:56:36,114 INFO 

Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Hello all.

Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop?  BSF seems nice since it allows two-way communication
between ruby and java.  I'd love to hear your thoughts as I've been
trying to make this work to allow using ruby in the m/r pipeline.  For
now, I don't need a fully general solution, I'd just like to call some
ruby in my map or reduce tasks.

Thanks!

-lincoln

--
lincolnritter.com


Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Devaraj Das



On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:

 On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
 Could you try to kill the tasktracker hosting the task the next time when
 it happens? I just want to isolate the problem - whether it is a problem in
 the TT-JT communication or in the Task-TT communication. From your
 description it looks like the problem is between the JT-TT communication.
 But pls run the experiment when it happens again and let us know what
 happens.
 
 Well, I did restart the tasktracker where the reduce job was running, but that
 lead only to a situation where the jobtracker did not restart the job, showed
 it as still running, and was not able to kill the reduce task via hadoop
 job -kill-task nor -fail-task.

The reduce task would eventually be reexecuted (after some timeout,
defaulting to 10 minutes, the tasktracker would be assumed as lost and all
reducers that were running on that node would be reexecuted).

 
 I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
 at another startup confirmed the whole batch of problems I've been
 experiencing, and for him 0.15 works for production.
 
 rant-mode
 No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
 could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
 reducing to our workloads, and before 0.16 failed 80% of the jobs with
 reducers not being able to get their output. 0.17.0 improved that to a point
 where one can, with some pain, e.g. restarting the cluster daily, not storing
 anything important on HDFS, only temporary data, ..., use it somehow for
 production, at least for small jobs.) So one wonders how 0.16 got released?
 Or was it meant only as developer-only bug fixing series?
 /rant-mode

Pls raise jiras for the specific problems.
 
 Sorry, this has been driving me up the walls into an asylum till I compared
 notes with a collegue, and decided that I'm not crazy ;)
 
 Andreas
 
 
 Thanks,
 Devaraj
 
 On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 Hi!
 
 I'm experiencing hung reducers, with the following symptoms:
 Task Logs: 'task_200807230647_0008_r_09_1'
 
 
 stdout logs
 
 
 
 stderr logs
 
 
 
 syslog logs
 
 red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
 location(s); scheduling... 2008-07-24 07:56:11,064 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:16,074 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:21,084 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:26,094 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:31,104 INFO
 

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe










Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
 On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
  On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
  Could you try to kill the tasktracker hosting the task the next time
  when it happens? I just want to isolate the problem - whether it is a
  problem in the TT-JT communication or in the Task-TT communication. From
  your description it looks like the problem is between the JT-TT
  communication. But pls run the experiment when it happens again and let
  us know what happens.
 
  Well, I did restart the tasktracker where the reduce job was running, but
  that lead only to a situation where the jobtracker did not restart the
  job, showed it as still running, and was not able to kill the reduce task
  via hadoop job -kill-task nor -fail-task.

 The reduce task would eventually be reexecuted (after some timeout,
 defaulting to 10 minutes, the tasktracker would be assumed as lost and all
 reducers that were running on that node would be reexecuted).

  I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
  peer at another startup confirmed the whole batch of problems I've been
  experiencing, and for him 0.15 works for production.
 
  rant-mode
  No question, 0.17 is way better than 0.16, on the other hand I wonder how
  0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
  introduced reducing to our workloads, and before 0.16 failed 80% of the
  jobs with reducers not being able to get their output. 0.17.0 improved
  that to a point where one can, with some pain, e.g. restarting the
  cluster daily, not storing anything important on HDFS, only temporary
  data, ..., use it somehow for production, at least for small jobs.) So
  one wonders how 0.16 got released? Or was it meant only as developer-only
  bug fixing series?
  /rant-mode

 Pls raise jiras for the specific problems.

I know, that's why I bracketed it as rantmode. OTOH, many of these issues had 
either this creepy feeling where you wondered if you did something wrong or 
were issues where I had to react relatively quickly, which usually destroys 
the faulty state. (I know, as a developer having reproduced a bug is golden. 
As an admin asked about processing lag, it's rather to opposite)

Plus fixing the issue in the next release or even via a patch means that I 
have a non-working cluster till then. Now I that means I would need to start 
debugging the cluster utility software instead of our apps. ;(

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams

Sweet, thanks.


Jason Venner wrote:

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe










--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]



about the overhead

2008-07-24 Thread Wei Jiang
Hi all,

Does hadoop provide a way to let the users know the time for
computation(map/reduce functions) and the time for different types of
overhead (such as the startup, sorting, i/o disk, etc.) respectively?

Thanks~~

Best regards,

-- 
---
Wei


Hadoop DFS

2008-07-24 Thread Wasim Bari
Hi,
I am new to Hadoop. Right now, I am Only interested to Work with Hadoop 
DFS. Can some one guide me where to start?  Anyone has information about some 
application has already integrated Hadoop DFS ?  

Any information regarding Material about Hadoop DFS, case studies, Articles, 
books etc will be very nice.

Thanks,

Wasim

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams
Ah, yeah, I found that one. :) Patching 
'java/org/apache/hadoop/mapred/JobInProgress.java' on 0.17.1.


-joe


Jason Venner wrote:

I have only applied this patch as far forward as 0.16.0

Joe Williams wrote:

Sweet, thanks.


Jason Venner wrote:

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and 
have been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED
--SNIP--


Is there more I need to do to get the metrics to show up in 
this output, am I doing something incorrectly? Do I need to 
have a gmetric script run in a cron to update the stats? If so, 
does anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe












--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]



Re: Hadoop and Fedora Core 6 Adventure, Need Help ASAP

2008-07-24 Thread hadoop hadoop-chetan
Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 configuration

property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

property
  namefs.default.name/name
  valuehdfs://localhost:54310/value
  descriptionThe name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem./description
/property

property
  namemapred.job.tracker/name
  valuelocalhost:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

 property
   namemapred.map.tasks/name
   value1/value
   description
 define mapred.map tasks to be number of slave hosts
   /description
  /property

 property
   namemapred.reduce.tasks/name
  value1/value
   description
 define mapred.reduce tasks to be number of slave hosts
   /description
  /property

property
  namedfs.replication/name
  value1/value
  descriptionDefault block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  /description
/property

property
   namemapred.child.java.opts/name
   value-Xmx1800m/value
   descriptionJava opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
   /description
/property


/configuration


Re: can hadoop read files backwards

2008-07-24 Thread Elia Mazzawi

never mind i got it.

Elia Mazzawi wrote:

I need some help with the implementation,  to have the mapper produce
key=id, value = type,timestamp
which is essentially string, string

what do i give output.collect for the Value,  i want to store type, 
timestamp it only takes Text, IntWritable but i want to store Text, 
Text ? or what can i store in there.


here is my reducer which doesn't work because output.collect doesn't 
want Text, Text


   public static class Map extends MapReduceBase implements 
MapperLongWritable, Text, Text, IntWritable {

   private Text Key = new Text();
   private Text Value = new Text();

   public void map(LongWritable key, Text value, 
OutputCollectorText, IntWritable output, Reporter reporter) throws 
IOException {

   String line = value.toString();

//   line is parsed and now i have 2 strings
//   String S1;   // contains the key
//   String S2;  // contains  the value

   Key.set(S1);
   Value.set(S2);
   output.collect(Key, Value);
   }
   }


Miles Osborne wrote:

unless you have a gigantic number of items with the same id, this is
straightforward.  have a mapper emit items of the form:

key=id, value = type,timestamp

and your reducer will then see all ids that have the same value 
together.

it is then a simple matter to process all items with the same id.  for
example, you could simply read them into a list and work on them in any
manner you see fit.

(note that hadoop is perfectly fine at dealing with multi-line 
items.  all

you need do is make sure that the items you want to process together all
share the same key)

Miles

2008/7/18 Elia Mazzawi [EMAIL PROTECTED]:

 

well here is the problem I'm trying to solve,

I have a data set that looks like this:

IDtype   Timestamp

A1X   1215647404
A2X   1215647405
A3X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I was planning to have the data sorted by ID then timestamp,
then read it backwards,  (or have it sorted by reverse timestamp)

go through it cashing all Y's for the same ID for 5 seconds to 
either find

a matching X or not.

the results don't need to be 100% accurate.

so if hadoop gives the same file with the same lines in order then this
will work.

seems hadoop is really good at solving problems that depend on 1 
line at a

time? but not multi lines?

hadoop has to get data in order, and be able to work on multi lines,
otherwise how can it be setting records in data sorts.

I'd appreciate other suggestions to go about doing this.

Jim R. Wilson wrote:

   

does wordcount get the lines in order? or are they random? can i have
 

hadoop return them in reverse order?




You can't really depend on the order that the lines are given - it's
best to think of them as random.  The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.

The idea is that any given line can be interpreted separately,
completely independent of any other line.  So in wordcount, this makes
sense.  For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).

So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way.  Without knowing
more about what you're trying to achieve, I can't offer any specifics.

Good luck!

-- Jim

On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:


 

I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe 
each file

is
one task )

do does wordcount get the lines in order? or are they random? can 
i have

hadoop return them in reverse order?

Jim R. Wilson wrote:


   
It sounds to me like you're talking about hadoop streaming 
(correct me

if I'm wrong there).  In that case, there's really no order to the
lines being doled out as I understand it.  Any given line could be
handed to any given mapper task running on any given node.

I may be wrong, of course, someone closer to the project could give
you the right answer in that case.

-- Jim R. Wilson (jimbojw)

On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:



 
is there a way to have hadoop hand over the lines of a file 
backwards

to
my
mapper ?

as in give the last line first.








  






Help Need to get Hadoop on Fedora Core 6

2008-07-24 Thread hadoop hadoop-chetan
Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 configuration

property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

property
  namefs.default.name/name
  valuehdfs://localhost:54310/value
  descriptionThe name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem./description
/property

property
  namemapred.job.tracker/name
  valuelocalhost:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

 property
   namemapred.map.tasks/name
   value1/value
   description
 define mapred.map tasks to be number of slave hosts
   /description
  /property

 property
   namemapred.reduce.tasks/name
  value1/value
   description
 define mapred.reduce tasks to be number of slave hosts
   /description
  /property

property
  namedfs.replication/name
  value1/value
  descriptionDefault block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  /description
/property

property
   namemapred.child.java.opts/name
   value-Xmx1800m/value
   descriptionJava opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
   /description
/property


/configuration


Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote:
 Hello all.

 Has anybody ever tried/considered using the Bean Scripting Framework
 within Hadoop?  BSF seems nice since it allows two-way communication
 between ruby and java.  I'd love to hear your thoughts as I've been
 trying to make this work to allow using ruby in the m/r pipeline.  For
 now, I don't need a fully general solution, I'd just like to call some
 ruby in my map or reduce tasks.

Why not use jruby? AFAIK, there is a complete ruby implementation on top of 
Java, and although I have not used it, I'd presume that it allows full usage 
of Java classes, as Jython does.

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
  Why not use jruby?

 Indeed!  I'm basically working from the JRuby wiki page on Java
 integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
 this one step at a time and, while I would love tighter integration,
 the recommended way is through the scripting frameworks.

 Right now, I most interested in taking some baby steps before going
 more general.  I welcome any and all feedback/suggestions.  Especially
 if you have tried this.  I will post any results if there is interest,
 but mostly I am trying to accomplish a pretty small task and am not
 yet thinking about a more general solution.

Guess I won't be a big resource for you then, the only thing that I did was 
implementing a tar program with Jython that creates/extracts from/to HDFS.

It was painful, but not to painful, and it's not Jythons fault, it's just that 
using these clunky interfaces/classes is painful to a Python developer. Guess 
the same feeling will come from Ruby developers.

(and that's not a problem of Hadoop, I think that most Java APIs feel clunky 
to people used to more powerful languages. :-P)

Andreas


signature.asc
Description: This is a digitally signed message part.


Trying to write to HDFS from mapreduce.

2008-07-24 Thread Erik Holstad
Hi!
I'm writing a mapreduce job where I want the output from the mapper to go
strait
to the HDFS without passing the reduce method. Have been told that I can do:
c.setOutputFormat(TextOutputFormat.class); also added
Path path = new Path(user);
FileOutputFormat.setOutputPath(c, path);

But I still ended up with the result in the local filesystem instead.

Regards Erik


Re: Trying to write to HDFS from mapreduce.

2008-07-24 Thread s29752-hadoopuser
I think your conf is incorrectly set and your job was run locally.  Also, have 
you done jobconf.setNumReduceTasks(0)?  Try running some example jobs to test 
your setting.

Nicholas Sze




- Original Message 
 From: Erik Holstad [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, July 24, 2008 3:17:40 PM
 Subject: Trying to write to HDFS from mapreduce.
 
 Hi!
 I'm writing a mapreduce job where I want the output from the mapper to go
 strait
 to the HDFS without passing the reduce method. Have been told that I can do:
 c.setOutputFormat(TextOutputFormat.class); also added
 Path path = new Path(user);
 FileOutputFormat.setOutputPath(c, path);
 
 But I still ended up with the result in the local filesystem instead.
 
 Regards Erik



Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Andreas,

If you wouldn't mind posting some snippets that would be great!  There
seems to be a general lack of examples out there so pretty much
anything would help.

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:06 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
  Why not use jruby?

 Indeed!  I'm basically working from the JRuby wiki page on Java
 integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
 this one step at a time and, while I would love tighter integration,
 the recommended way is through the scripting frameworks.

 Right now, I most interested in taking some baby steps before going
 more general.  I welcome any and all feedback/suggestions.  Especially
 if you have tried this.  I will post any results if there is interest,
 but mostly I am trying to accomplish a pretty small task and am not
 yet thinking about a more general solution.

 Guess I won't be a big resource for you then, the only thing that I did was
 implementing a tar program with Jython that creates/extracts from/to HDFS.

 It was painful, but not to painful, and it's not Jythons fault, it's just that
 using these clunky interfaces/classes is painful to a Python developer. Guess
 the same feeling will come from Ruby developers.

 (and that's not a problem of Hadoop, I think that most Java APIs feel clunky
 to people used to more powerful languages. :-P)

 Andreas



Re: Bean Scripting Framework?

2008-07-24 Thread James Moore
Funny you should mention it - I'm working on a framework to do JRuby
Hadoop this week.  Something like:

class MyHadoopJob  Radoop
  input_format :text_input_format
  output_format :text_output_format
  map_output_key_class :text
  map_output_value_class :text

  def mapper(k, v, output, reporter)
# ...
  end

  def reducer(k, vs, output, reporter)
  end
end

Plus a java glue file to call the Ruby stuff.

And then it jars up the ruby files, the gem directory, and goes from there.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Well that sounds awesome!  It would be simply splendid to see what
you've got if you're willing to share.

Are you going the 'direct' embedding route or using a scripting frame
work (BSF or javax.script)?

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:42 PM, James Moore [EMAIL PROTECTED] wrote:
 Funny you should mention it - I'm working on a framework to do JRuby
 Hadoop this week.  Something like:

 class MyHadoopJob  Radoop
  input_format :text_input_format
  output_format :text_output_format
  map_output_key_class :text
  map_output_value_class :text

  def mapper(k, v, output, reporter)
# ...
  end

  def reducer(k, vs, output, reporter)
  end
 end

 Plus a java glue file to call the Ruby stuff.

 And then it jars up the ruby files, the gem directory, and goes from there.

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com



Need help to setup Hadoop on Fedora Core 6

2008-07-24 Thread hadoop hadoop-chetan
Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 configuration

property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

property
  namefs.default.name/name
  valuehdfs://localhost:54310/value
  descriptionThe name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem./description
/property

property
  namemapred.job.tracker/name
  valuelocalhost:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

 property
   namemapred.map.tasks/name
   value1/value
   description
 define mapred.map tasks to be number of slave hosts
   /description
  /property

 property
   namemapred.reduce.tasks/name
  value1/value
   description
 define mapred.reduce tasks to be number of slave hosts
   /description
  /property

property
  namedfs.replication/name
  value1/value
  descriptionDefault block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  /description
/property

property
   namemapred.child.java.opts/name
   value-Xmx1800m/value
   descriptionJava opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
   /description
/property


/configuration


Re: Name node heap space problem

2008-07-24 Thread Taeho Kang
Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with export HADOOP_HEAPSIZE=1000

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer [EMAIL PROTECTED]
wrote:

 Update on this one...

 I put some more memory in the machine running the name node. Now fsck is
 running. Unfortunately ls fails with a time-out.

 I identified one directory that causes the trouble. I can run fsck on it
 but not ls.

 What could be the problem?

 Gert

 Gert Pfeifer schrieb:

 Hi,
 I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
 and one secondary name node.

 I have 1788874 files and directories, 1465394 blocks = 3254268 total.
 Heap Size max is 3.47 GB.

 My problem is that I produce many small files. Therefore I have a cron
 job which just runs daily across the new files and copies them into
 bigger files and deletes the small files.

 Apart from this program, even a fsck kills the cluster.

 The problem is that, as soon as I start this program, the heap space of
 the name node reaches 100 %.

 What could be the problem? There are not many small files right now and
 still it doesn't work. I guess we have this problem since the upgrade to
 0.17.

 Here is some additional data about the DFS:
 Capacity :   2 TB
 DFS Remaining   :   1.19 TB
 DFS Used:   719.35 GB
 DFS Used%   :   35.16 %

 Thanks for hints,
 Gert





Re: Bean Scripting Framework?

2008-07-24 Thread James Moore
On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
[EMAIL PROTECTED] wrote:
 Well that sounds awesome!  It would be simply splendid to see what
 you've got if you're willing to share.

I'll be happy to share, but it's pretty much in pieces, not ready for
release.  I'll put it out with whatever license Hadoop itself uses
(presumably Apache).


 Are you going the 'direct' embedding route or using a scripting frame
 work (BSF or javax.script)?

JSR233 is the way to go according to the JRuby guys at RailsConf last
month.  It's pretty straightforward - see
http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Bean Scripting Framework?

2008-07-24 Thread Venkat Seeth
Why dont you use hadoop streaming?


- Original Message 
From: Lincoln Ritter [EMAIL PROTECTED]
To: core-user core-user@hadoop.apache.org
Sent: Friday, July 25, 2008 1:10:20 AM
Subject: Bean Scripting Framework?

Hello all.

Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop?  BSF seems nice since it allows two-way communication
between ruby and java.  I'd love to hear your thoughts as I've been
trying to make this work to allow using ruby in the m/r pipeline.  For
now, I don't need a fully general solution, I'd just like to call some
ruby in my map or reduce tasks.

Thanks!

-lincoln

--
lincolnritter.com