date:20080724

Re: Bean Scripting Framework?

2008-07-24 Thread Venkat Seeth

Why dont you use hadoop streaming?

- Original Message 
From: Lincoln Ritter <[EMAIL PROTECTED]>
To: core-user 
Sent: Friday, July 25, 2008 1:10:20 AM
Subject: Bean Scripting Framework?

Hello all.

Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop?  BSF seems nice since it allows "two-way" communication
between ruby and java.  I'd love to hear your thoughts as I've been
trying to make this work to allow using ruby in the m/r pipeline.  For
now, I don't need a fully general solution, I'd just like to call some
ruby in my map or reduce tasks.

Thanks!

-lincoln

--
lincolnritter.com

Re: Bean Scripting Framework?

2008-07-24 Thread James Moore

On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> Well that sounds awesome!  It would be simply splendid to see what
> you've got if you're willing to share.

I'll be happy to share, but it's pretty much in pieces, not ready for
release.  I'll put it out with whatever license Hadoop itself uses
(presumably Apache).

>
> Are you going the 'direct' embedding route or using a scripting frame
> work (BSF or javax.script)?

JSR233 is the way to go according to the JRuby guys at RailsConf last
month.  It's pretty straightforward - see
http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: Name node heap space problem

2008-07-24 Thread Taeho Kang

Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with "export HADOOP_HEAPSIZE=1000"

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <[EMAIL PROTECTED]>
wrote:

> Update on this one...
>
> I put some more memory in the machine running the name node. Now fsck is
> running. Unfortunately ls fails with a time-out.
>
> I identified one directory that causes the trouble. I can run fsck on it
> but not ls.
>
> What could be the problem?
>
> Gert
>
> Gert Pfeifer schrieb:
>
> Hi,
>> I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
>> and one secondary name node.
>>
>> I have 1788874 files and directories, 1465394 blocks = 3254268 total.
>> Heap Size max is 3.47 GB.
>>
>> My problem is that I produce many small files. Therefore I have a cron
>> job which just runs daily across the new files and copies them into
>> bigger files and deletes the small files.
>>
>> Apart from this program, even a fsck kills the cluster.
>>
>> The problem is that, as soon as I start this program, the heap space of
>> the name node reaches 100 %.
>>
>> What could be the problem? There are not many small files right now and
>> still it doesn't work. I guess we have this problem since the upgrade to
>> 0.17.
>>
>> Here is some additional data about the DFS:
>> Capacity :   2 TB
>> DFS Remaining   :   1.19 TB
>> DFS Used:   719.35 GB
>> DFS Used%   :   35.16 %
>>
>> Thanks for hints,
>> Gert
>>
>
>

Need help to setup Hadoop on Fedora Core 6

2008-07-24 Thread hadoop hadoop-chetan

Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 


  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  


 
   mapred.map.tasks
   1
   
 define mapred.map tasks to be number of slave hosts
   
  

 
   mapred.reduce.tasks
  1
   
 define mapred.reduce tasks to be number of slave hosts
   
  


  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  



   mapred.child.java.opts
   -Xmx1800m
   Java opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]

Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter

Well that sounds awesome!  It would be simply splendid to see what
you've got if you're willing to share.

Are you going the 'direct' embedding route or using a scripting frame
work (BSF or javax.script)?

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:42 PM, James Moore <[EMAIL PROTECTED]> wrote:
> Funny you should mention it - I'm working on a framework to do JRuby
> Hadoop this week.  Something like:
>
> class MyHadoopJob < Radoop
>  input_format :text_input_format
>  output_format :text_output_format
>  map_output_key_class :text
>  map_output_value_class :text
>
>  def mapper(k, v, output, reporter)
># ...
>  end
>
>  def reducer(k, vs, output, reporter)
>  end
> end
>
> Plus a java glue file to call the Ruby stuff.
>
> And then it jars up the ruby files, the gem directory, and goes from there.
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>

Re: Bean Scripting Framework?

2008-07-24 Thread James Moore

Funny you should mention it - I'm working on a framework to do JRuby
Hadoop this week.  Something like:

class MyHadoopJob < Radoop
  input_format :text_input_format
  output_format :text_output_format
  map_output_key_class :text
  map_output_value_class :text

  def mapper(k, v, output, reporter)
# ...
  end

  def reducer(k, vs, output, reporter)
  end
end

Plus a java glue file to call the Ruby stuff.

And then it jars up the ruby files, the gem directory, and goes from there.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter

Andreas,

If you wouldn't mind posting some snippets that would be great!  There
seems to be a general lack of examples out there so pretty much
anything would help.

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:06 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
> On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
>> > Why not use jruby?
>>
>> Indeed!  I'm basically working from the JRuby wiki page on Java
>> integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
>> this one step at a time and, while I would love tighter integration,
>> the recommended way is through the scripting frameworks.
>>
>> Right now, I most interested in taking some baby steps before going
>> more general.  I welcome any and all feedback/suggestions.  Especially
>> if you have tried this.  I will post any results if there is interest,
>> but mostly I am trying to accomplish a pretty small task and am not
>> yet thinking about a more general solution.
>
> Guess I won't be a big resource for you then, the only thing that I did was
> implementing a tar program with Jython that creates/extracts from/to HDFS.
>
> It was painful, but not to painful, and it's not Jythons fault, it's just that
> using these clunky interfaces/classes is painful to a Python developer. Guess
> the same feeling will come from Ruby developers.
>
> (and that's not a problem of Hadoop, I think that most Java APIs feel clunky
> to people used to more powerful languages. :-P)
>
> Andreas
>

Re: Trying to write to HDFS from mapreduce.

2008-07-24 Thread s29752-hadoopuser

I think your conf is incorrectly set and your job was run locally.  Also, have 
you done jobconf.setNumReduceTasks(0)?  Try running some example jobs to test 
your setting.

Nicholas Sze




- Original Message 
> From: Erik Holstad <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, July 24, 2008 3:17:40 PM
> Subject: Trying to write to HDFS from mapreduce.
> 
> Hi!
> I'm writing a mapreduce job where I want the output from the mapper to go
> strait
> to the HDFS without passing the reduce method. Have been told that I can do:
> c.setOutputFormat(TextOutputFormat.class); also added
> Path path = new Path("user");
> FileOutputFormat.setOutputPath(c, path);
> 
> But I still ended up with the result in the local filesystem instead.
> 
> Regards Erik

Re: can hadoop read files backwards

2008-07-24 Thread James Moore

On Fri, Jul 18, 2008 at 2:06 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> unless you have a gigantic number of items with the same id, this is
> straightforward.  have a mapper emit items of the form:
>
> key=id, value = type,timestamp

Or if you do have a large (by hadoop standards) number of items with
the same id, use the timestamp + id for the key, emit one row for
timestamp through timestamp + 5, and put a unique identifier in the
row I think you can get a guaranteed-unique id from mapred.task.id
(but check me on that), and just add a counter to that:

IDtype   Timestamp

A1X   1215647404
A1   Y   1215647408

becomes

1215647404/a1, x, uniqueidX
1215647405/a1, x, uniqueidX
1215647406/a1, x, uniqueidX
1215647407/a1, x, uniqueidX
1215647408/a1, x, uniqueidX
1215647408/a1, y, uniqueidY
1215647409/a1, y, uniqueidY
1215647410/a1, y, uniqueidY
etc

If a key has a uniqueX, then write all the uniqueYs.  Then the problem
just becomes WordCount as a second pass.  (Someone more clever than
myself can probably do this in one pass...)

Your mapper ends up spitting out 5x more rows, but your reducer has
many fewer rows to keep in memory.  At Hadoop scales, that might
matter.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Trying to write to HDFS from mapreduce.

2008-07-24 Thread Erik Holstad

Hi!
I'm writing a mapreduce job where I want the output from the mapper to go
strait
to the HDFS without passing the reduce method. Have been told that I can do:
c.setOutputFormat(TextOutputFormat.class); also added
Path path = new Path("user");
FileOutputFormat.setOutputPath(c, path);

But I still ended up with the result in the local filesystem instead.

Regards Erik

Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka

On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
> > Why not use jruby?
>
> Indeed!  I'm basically working from the JRuby wiki page on Java
> integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
> this one step at a time and, while I would love tighter integration,
> the recommended way is through the scripting frameworks.
>
> Right now, I most interested in taking some baby steps before going
> more general.  I welcome any and all feedback/suggestions.  Especially
> if you have tried this.  I will post any results if there is interest,
> but mostly I am trying to accomplish a pretty small task and am not
> yet thinking about a more general solution.

Guess I won't be a big resource for you then, the only thing that I did was 
implementing a tar program with Jython that creates/extracts from/to HDFS.

It was painful, but not to painful, and it's not Jythons fault, it's just that 
using these clunky interfaces/classes is painful to a Python developer. Guess 
the same feeling will come from Ruby developers.

(and that's not a problem of Hadoop, I think that most Java APIs feel clunky 
to people used to more powerful languages. :-P)

Andreas

signature.asc
Description: This is a digitally signed message part.

Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter

> Why not use jruby?

Indeed!  I'm basically working from the JRuby wiki page on Java
integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
this one step at a time and, while I would love tighter integration,
the recommended way is through the scripting frameworks.

Right now, I most interested in taking some baby steps before going
more general.  I welcome any and all feedback/suggestions.  Especially
if you have tried this.  I will post any results if there is interest,
but mostly I am trying to accomplish a pretty small task and am not
yet thinking about a more general solution.

-lincoln

--
lincolnritter.com

On Thu, Jul 24, 2008 at 1:58 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
> On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote:
>> Hello all.
>>
>> Has anybody ever tried/considered using the Bean Scripting Framework
>> within Hadoop?  BSF seems nice since it allows "two-way" communication
>> between ruby and java.  I'd love to hear your thoughts as I've been
>> trying to make this work to allow using ruby in the m/r pipeline.  For
>> now, I don't need a fully general solution, I'd just like to call some
>> ruby in my map or reduce tasks.
>
> Why not use jruby? AFAIK, there is a complete ruby implementation on top of
> Java, and although I have not used it, I'd presume that it allows full usage
> of Java classes, as Jython does.
>
> Andreas
>

Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka

On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote:
> Hello all.
>
> Has anybody ever tried/considered using the Bean Scripting Framework
> within Hadoop?  BSF seems nice since it allows "two-way" communication
> between ruby and java.  I'd love to hear your thoughts as I've been
> trying to make this work to allow using ruby in the m/r pipeline.  For
> now, I don't need a fully general solution, I'd just like to call some
> ruby in my map or reduce tasks.

Why not use jruby? AFAIK, there is a complete ruby implementation on top of 
Java, and although I have not used it, I'd presume that it allows full usage 
of Java classes, as Jython does.

Andreas

signature.asc
Description: This is a digitally signed message part.

Help Need to get Hadoop on Fedora Core 6

2008-07-24 Thread hadoop hadoop-chetan

Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 


  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  


 
   mapred.map.tasks
   1
   
 define mapred.map tasks to be number of slave hosts
   
  

 
   mapred.reduce.tasks
  1
   
 define mapred.reduce tasks to be number of slave hosts
   
  


  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  



   mapred.child.java.opts
   -Xmx1800m
   Java opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]

Re: can hadoop read files backwards

2008-07-24 Thread Elia Mazzawi


never mind i got it.

Elia Mazzawi wrote:

I need some help with the implementation,  to have the mapper produce
key=id, value = type,timestamp
which is essentially string, string

what do i give output.collect for the Value,  i want to store type, 
timestamp it only takes  but i want to store Text> ? or what can i store in there.


here is my reducer which doesn't work because output.collect doesn't 
want 


   public static class Map extends MapReduceBase implements 
Mapper {

   private Text Key = new Text();
   private Text Value = new Text();

   public void map(LongWritable key, Text value, 
OutputCollector output, Reporter reporter) throws 
IOException {

   String line = value.toString();

//   line is parsed and now i have 2 strings
//   String S1;   // contains the key
//   String S2;  // contains  the value

   Key.set(S1);
   Value.set(S2);
   output.collect(Key, Value);
   }
   }


Miles Osborne wrote:

unless you have a gigantic number of items with the same id, this is
straightforward.  have a mapper emit items of the form:

key=id, value = type,timestamp

and your reducer will then see all ids that have the same value 
together.

it is then a simple matter to process all items with the same id.  for
example, you could simply read them into a list and work on them in any
manner you see fit.

(note that hadoop is perfectly fine at dealing with multi-line 
items.  all

you need do is make sure that the items you want to process together all
share the same key)

Miles

2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>:

 

well here is the problem I'm trying to solve,

I have a data set that looks like this:

IDtype   Timestamp

A1X   1215647404
A2X   1215647405
A3X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I was planning to have the data sorted by ID then timestamp,
then read it backwards,  (or have it sorted by reverse timestamp)

go through it cashing all Y's for the same ID for 5 seconds to 
either find

a matching X or not.

the results don't need to be 100% accurate.

so if hadoop gives the same file with the same lines in order then this
will work.

seems hadoop is really good at solving problems that depend on 1 
line at a

time? but not multi lines?

hadoop has to get data in order, and be able to work on multi lines,
otherwise how can it be setting records in data sorts.

I'd appreciate other suggestions to go about doing this.

Jim R. Wilson wrote:

   

does wordcount get the lines in order? or are they random? can i have
 

hadoop return them in reverse order?




You can't really depend on the order that the lines are given - it's
best to think of them as random.  The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.

The idea is that any given line can be interpreted separately,
completely independent of any other line.  So in wordcount, this makes
sense.  For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).

So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way.  Without knowing
more about what you're trying to achieve, I can't offer any specifics.

Good luck!

-- Jim

On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:


 

I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe 
each file

is
one task )

do does wordcount get the lines in order? or are they random? can 
i have

hadoop return them in reverse order?

Jim R. Wilson wrote:


   
It sounds to me like you're talking about hadoop streaming 
(correct me

if I'm wrong there).  In that case, there's really no "order" to the
lines being doled out as I understand it.  Any given line could be
handed to any given mapper task running on any given node.

I may be wrong, of course, someone closer to the project could give
you the right answer in that case.

-- Jim R. Wilson (jimbojw)

On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:



 
is there a way to have hadoop hand over the lines of a file 
backwards

to
my
mapper ?

as in give the last line first.

Re: Hadoop and Fedora Core 6 Adventure, Need Help ASAP

2008-07-24 Thread hadoop hadoop-chetan

Hello Folks

   I somebody has successfully installed Hadoop on FC 6, Please Help
!!!

   Just bootstrapping into the Haddop madness and was attempting to install
hadoop on Fedora Core 6.
   Tried all sorts of things but couldn't get past this error which is not
starting the reduce tasks

2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_200807241301_0001_r_00_0: java.lang.NullPointerException
at java.util.Hashtable.get(Hashtable.java:334)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Before you ask, here are the details:

 1. Running hadoop as a single node cluster
 2. Disabled IPV6
 3. Using Hadoop version */hadoop-0.17.1/*
 4. enabled ssh to access local machine
 5. Master and Slaves are set to localhost
 6. Created simple sample file and loaded into DFS
 7. Encountered error when I was running the sample with the wordcount
example provided with the package
 8. Here is my hadoop-site.xml

 


  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  


 
   mapred.map.tasks
   1
   
 define mapred.map tasks to be number of slave hosts
   
  

 
   mapred.reduce.tasks
  1
   
 define mapred.reduce tasks to be number of slave hosts
   
  


  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  



   mapred.child.java.opts
   -Xmx1800m
   Java opts for the task tracker child processes.
   The following symbol, if present, will be interpolated: @taskid@ is
 replaced by current TaskID. Any other occurrences of '@' will go unchanged.
 For example, to enable verbose gc logging to a file named for the taskid in
 /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
 -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams

Ah, yeah, I found that one. :) Patching 
'java/org/apache/hadoop/mapred/JobInProgress.java' on 0.17.1.


-joe


Jason Venner wrote:

I have only applied this patch as far forward as 0.16.0

Joe Williams wrote:

Sweet, thanks.


Jason Venner wrote:

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and 
have been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in 
this output, am I doing something incorrectly? Do I need to 
have a gmetric script run in a cron to update the stats? If so, 
does anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe












--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner


I have only applied this patch as far forward as 0.16.0

Joe Williams wrote:

Sweet, thanks.


Jason Venner wrote:

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe

Hadoop DFS

2008-07-24 Thread Wasim Bari

Hi,
I am new to Hadoop. Right now, I am Only interested to Work with Hadoop 
DFS. Can some one guide me where to start?  Anyone has information about some 
application has already integrated Hadoop DFS ?  

Any information regarding Material about Hadoop DFS, case studies, Articles, 
books etc will be very nice.

Thanks,

Wasim

about the overhead

2008-07-24 Thread Wei Jiang

Hi all,

Does hadoop provide a way to let the users know the time for
computation(map/reduce functions) and the time for different types of
overhead (such as the startup, sorting, i/o disk, etc.) respectively?

Thanks~~

Best regards,

-- 
---
Wei

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams


Sweet, thanks.


Jason Venner wrote:

Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe










--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka

On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. From
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, but
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce task
> > via hadoop job -kill-task nor -fail-task.
>
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
>
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > 
> > No question, 0.17 is way better than 0.16, on the other hand I wonder how
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-only
> > bug fixing series?
> > 
>
> Pls raise jiras for the specific problems.

I know, that's why I bracketed it as rantmode. OTOH, many of these issues had 
either this creepy feeling where you wondered if you did something wrong or 
were issues where I had to react relatively quickly, which usually destroys 
the faulty state. (I know, as a developer having reproduced a bug is golden. 
As an admin asked about processing lag, it's rather to opposite)

Plus fixing the issue in the next release or even via a patch means that I 
have a non-working cluster till then. Now I that means I would need to start 
debugging the cluster utility software instead of our apps. ;(

Andreas


signature.asc
Description: This is a digitally signed message part.

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner


Once the patch is applied you should start seeing the ganglia metrics

We do.


Joe Williams wrote:
Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling 
stats from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:

[EMAIL PROTECTED] current]# cat 
conf/hadoop-metrics.properties

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML 
metrics output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Devaraj Das




On 7/25/08 12:09 AM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:

> On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
>> Could you try to kill the tasktracker hosting the task the next time when
>> it happens? I just want to isolate the problem - whether it is a problem in
>> the TT-JT communication or in the Task-TT communication. From your
>> description it looks like the problem is between the JT-TT communication.
>> But pls run the experiment when it happens again and let us know what
>> happens.
> 
> Well, I did restart the tasktracker where the reduce job was running, but that
> lead only to a situation where the jobtracker did not restart the job, showed
> it as still running, and was not able to kill the reduce task via hadoop
> job -kill-task nor -fail-task.

The reduce task would eventually be reexecuted (after some timeout,
defaulting to 10 minutes, the tasktracker would be assumed as lost and all
reducers that were running on that node would be reexecuted).

> 
> I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
> at another startup confirmed the whole batch of problems I've been
> experiencing, and for him 0.15 works for production.
> 
> 
> No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
> could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
> reducing to our workloads, and before 0.16 failed >80% of the jobs with
> reducers not being able to get their output. 0.17.0 improved that to a point
> where one can, with some pain, e.g. restarting the cluster daily, not storing
> anything important on HDFS, only temporary data, ..., use it somehow for
> production, at least for small jobs.) So one wonders how 0.16 got released?
> Or was it meant only as developer-only bug fixing series?
> 
>
Pls raise jiras for the specific problems.
 
> Sorry, this has been driving me up the walls into an asylum till I compared
> notes with a collegue, and decided that I'm not crazy ;)
> 
> Andreas
> 
>> 
>> Thanks,
>> Devaraj
>> 
>> On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:
>>> Hi!
>>> 
>>> I'm experiencing hung reducers, with the following symptoms:
 Task Logs: 'task_200807230647_0008_r_09_1'
 
 
 stdout logs
 
 
 
 stderr logs
 
 
 
 syslog logs
 
 red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
 location(s); scheduling... 2008-07-24 07:56:11,064 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:16,074 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:21,084 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Got 0 known map output location(s);
 scheduling... 2008-07-24 07:56:26,094 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
 map-outputs from t

Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter

Hello all.

Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop?  BSF seems nice since it allows "two-way" communication
between ruby and java.  I'd love to hear your thoughts as I've been
trying to make this work to allow using ruby in the m/r pipeline.  For
now, I don't need a fully general solution, I'd just like to call some
ruby in my map or reduce tasks.

Thanks!

-lincoln

--
lincolnritter.com

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams

Once I have the patch applied and have it running should I see the 
metrics? Or do I need to additional work?


Thanks.
-Joe


Jason Venner wrote:

I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling stats 
from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does 
anyone have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe








--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka

On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> Could you try to kill the tasktracker hosting the task the next time when
> it happens? I just want to isolate the problem - whether it is a problem in
> the TT-JT communication or in the Task-TT communication. From your
> description it looks like the problem is between the JT-TT communication.
> But pls run the experiment when it happens again and let us know what
> happens.

Well, I did restart the tasktracker where the reduce job was running, but that 
lead only to a situation where the jobtracker did not restart the job, showed 
it as still running, and was not able to kill the reduce task via hadoop 
job -kill-task nor -fail-task.

I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer 
at another startup confirmed the whole batch of problems I've been 
experiencing, and for him 0.15 works for production.


No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 
could get released? (I'm using streaming.jar, and with 0.16.x I've introduced 
reducing to our workloads, and before 0.16 failed >80% of the jobs with 
reducers not being able to get their output. 0.17.0 improved that to a point 
where one can, with some pain, e.g. restarting the cluster daily, not storing 
anything important on HDFS, only temporary data, ..., use it somehow for 
production, at least for small jobs.) So one wonders how 0.16 got released? 
Or was it meant only as developer-only bug fixing series?


Sorry, this has been driving me up the walls into an asylum till I compared 
notes with a collegue, and decided that I'm not crazy ;)

Andreas

>
> Thanks,
> Devaraj
>
> On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:
> > Hi!
> >
> > I'm experiencing hung reducers, with the following symptoms:
> >> Task Logs: 'task_200807230647_0008_r_09_1'
> >>
> >>
> >> stdout logs
> >>
> >>
> >>
> >> stderr logs
> >>
> >>
> >>
> >> syslog logs
> >>
> >> red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
> >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:16,074 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:21,084 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:26,094 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:31,104 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup host

Re: can hadoop read files backwards

2008-07-24 Thread Elia Mazzawi


I need some help with the implementation,  to have the mapper produce
key=id, value = type,timestamp
which is essentially string, string

what do i give output.collect for the Value,  i want to store type, 
timestamp it only takes  but i want to store Text> ? or what can i store in there.


here is my reducer which doesn't work because output.collect doesn't 
want 


   public static class Map extends MapReduceBase implements 
Mapper {

   private Text Key = new Text();
   private Text Value = new Text();

   public void map(LongWritable key, Text value, 
OutputCollector output, Reporter reporter) throws 
IOException {

   String line = value.toString();

//   line is parsed and now i have 2 strings
//   String S1;   // contains the key
//   String S2;  // contains  the value

   Key.set(S1);
   Value.set(S2);
   output.collect(Key, Value);
   }
   }


Miles Osborne wrote:

unless you have a gigantic number of items with the same id, this is
straightforward.  have a mapper emit items of the form:

key=id, value = type,timestamp

and your reducer will then see all ids that have the same value together.
it is then a simple matter to process all items with the same id.  for
example, you could simply read them into a list and work on them in any
manner you see fit.

(note that hadoop is perfectly fine at dealing with multi-line items.  all
you need do is make sure that the items you want to process together all
share the same key)

Miles

2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>:

  

well here is the problem I'm trying to solve,

I have a data set that looks like this:

IDtype   Timestamp

A1X   1215647404
A2X   1215647405
A3X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I was planning to have the data sorted by ID then timestamp,
then read it backwards,  (or have it sorted by reverse timestamp)

go through it cashing all Y's for the same ID for 5 seconds to either find
a matching X or not.

the results don't need to be 100% accurate.

so if hadoop gives the same file with the same lines in order then this
will work.

seems hadoop is really good at solving problems that depend on 1 line at a
time? but not multi lines?

hadoop has to get data in order, and be able to work on multi lines,
otherwise how can it be setting records in data sorts.

I'd appreciate other suggestions to go about doing this.

Jim R. Wilson wrote:



does wordcount get the lines in order? or are they random? can i have
  

hadoop return them in reverse order?




You can't really depend on the order that the lines are given - it's
best to think of them as random.  The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.

The idea is that any given line can be interpreted separately,
completely independent of any other line.  So in wordcount, this makes
sense.  For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).

So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way.  Without knowing
more about what you're trying to achieve, I can't offer any specifics.

Good luck!

-- Jim

On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:


  

I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe each file
is
one task )

do does wordcount get the lines in order? or are they random? can i have
hadoop return them in reverse order?

Jim R. Wilson wrote:




It sounds to me like you're talking about hadoop streaming (correct me
if I'm wrong there).  In that case, there's really no "order" to the
lines being doled out as I understand it.  Any given line could be
handed to any given mapper task running on any given node.

I may be wrong, of course, someone closer to the project could give
you the right answer in that case.

-- Jim R. Wilson (jimbojw)

On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
<[EMAIL PROTECTED]> wrote:



  

is there a way to have hadoop hand over the lines of a file backwards
to
my
mapper ?

as in give the last line first.

Anybody used AppNexus for hosting Hadoop app?

2008-07-24 Thread jeremy.huylebroeck


I discovered AppNexus yesterday.
They offer hosting similar to Amazon EC2, with apparently more dedicated
hardware and a better notion of where things are in the datacenter.

Their web site says they are optimized for Hadoop applications.

Anybody tried and could give some feedback?
 
J.

Re: Name node heap space problem

2008-07-24 Thread Gert Pfeifer


Update on this one...

I put some more memory in the machine running the name node. Now fsck is 
running. Unfortunately ls fails with a time-out.


I identified one directory that causes the trouble. I can run fsck on it 
but not ls.


What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,
I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert

Anyway to order all the output folder?

2008-07-24 Thread Xing


Hi All,

There are 30 output folders using Hadoop. Each folder it is in ascending 
order, but the order is not ascending among folders, like the value is 
1, 5 , 10 in folder A and 6, 8, 9 in folder B.
My question is how to enforce the order among all the folders as well, 
as output value 1, 5, 6 in folder A and 8, 9, 10 in folder B.

I just start to learn Hadoop and hope you can help me. :)
Thanks

Shane

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner


I applied the patch in the jira to my distro

Joe Williams wrote:
Thanks Jason, until this is implemented are how are you pulling stats 
from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have 
been unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a 
gmetric script run in a cron to update the stats? If so, does anyone 
have a hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Joe Williams

Thanks Jason, until this is implemented are how are you pulling stats 
from Hadoop?


-joe


Jason Venner wrote:

Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have been 
unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a gmetric 
script run in a cron to update the stats? If so, does anyone have a 
hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe






--
Name: Joseph A. Williams
Email: [EMAIL PROTECTED]

Re: How to write one file per key as mapreduce output

2008-07-24 Thread Stuart Sierra

On Tue, Jul 22, 2008 at 8:04 PM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> I have what I think is a pretty straight-forward, noobie question.  I
> would like to write one file per key in the reduce (or map) phase of a
> mapreduce job.  I have looked at the documentation for
> FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
> how to use it/them.  Can anybody give me a quick pointer?

Hi Lincoln,

I do something like this to dump my records out, one per file, for
debugging.  This may not be "correct" because it writes the files as
side-effects of the job, but hey, it works.  It looks something like
this:

public static class MyMap extends MapReduceBase
implements Mapper {

private JobConf conf;

public void configure(JobConf conf) {
this.conf = conf;
}

public void map(VIntWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {

FileSystem fs = FileSystem.get(conf);
Path workPath = FileOutputFormat.getWorkOutputPath(conf);
Path filePath = new Path(workPath, key.toString());
OutputStream out = fs.create(filePath);
/* ... write value to out ... */
out.close();
}
}

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Devaraj Das

Could you try to kill the tasktracker hosting the task the next time when it
happens? I just want to isolate the problem - whether it is a problem in the
TT-JT communication or in the Task-TT communication. From your description
it looks like the problem is between the JT-TT communication. But pls run
the experiment when it happens again and let us know what happens.

Thanks,
Devaraj


On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:

> Hi!
> 
> I'm experiencing hung reducers, with the following symptoms:
> 
>> Task Logs: 'task_200807230647_0008_r_09_1'
>> 
>> 
>> stdout logs
>> 
>> 
>> 
>> stderr logs
>> 
>> 
>> 
>> syslog logs
>> 
>> red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:16,074 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:21,084 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:26,094 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:31,104 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:36,114 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:41,126 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 
> 
> Notice how it needs 6 map outputs, all ma

Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber

Yes, I think this is the simplest method , but there are problems too:

1. The reduce stage wouldn't begin until the map stage ends, by when we have 
done a two table scanning, and the comparing will take almost the same time, 
because about 90% of intermediate  pairs will have two values and 
different keys, if I can specify a number n, by when there are n intermediate 
pairs with the same key the reduce tasks start, that will be better. In my case 
I will set the magic number to 2.

2. I am not sure about how Hadoop stores intermediate  pairs, we 
would not afford it as data volume increasing if it is kept in memory.

--
From: "James Moore" <[EMAIL PROTECTED]>
Sent: Thursday, July 24, 2008 1:12 AM
To: 
Subject: Re: Using MapReduce to do table comparing.

> On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the 
>> table is exported as a csv text file, which is about 30GB in size, then the 
>> csv file is imported into a RDBMS table which is dropped and recreated every 
>> day. Now we want to find how many rows are updated during each export-import 
>> interval, the table has a primary key, so deletes and inserts can be found 
>> using RDBMS joins quickly, but we must do a column to column comparing in 
>> order to find the difference between rows ( about 90%) with the same primary 
>> keys. Our goal is to find a comparing process which takes no more than 10 
>> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
>> CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>>The old data is kept in the RDBMS with index created on the primary key, 
>> the new data is imported into HDFS as the input file of our Map-Reduce job. 
>> Every map task connects to the RDBMS database, and selects old data from it 
>> for every row, map tasks will generate outputs if differences are found, and 
>> there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the 
>> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
>> but we have no idea about how to retrieve the old row with a given key 
>> quickly from HDFS files, any suggestion is welcome.
> 
> Think of map/reduce as giving you a kind of key/value lookup for free
> - it just falls out of how the system works.
> 
> You don't care about the RDBMS.  It's a distraction - you're given a
> set of csv files with unique keys and dates, and you need to find the
> differences between them.
> 
> Say the data looks like this:
> 
> File for jul 10:
> 0x1,stuff
> 0x2,more stuff
> 
> File for jul 11:
> 0x1,stuff
> 0x2,apples
> 0x3,parrot
> 
> Preprocess the csv files to add dates to the values:
> 
> File for jul 10:
> 0x1,20080710,stuff
> 0x2,20080710,more stuff
> 
> File for jul 11:
> 0x1,20080711,stuff
> 0x2,20080711,apples
> 0x3,20080711,parrot
> 
> Feed two days worth of these files into a hadoop job.
> 
> The mapper splits these into k=0x1, v=20080710,stuff etc.
> 
> The reducer gets one or two v's per key, and each v has the date
> embedded in it - that's essentially your lookup step.
> 
> You'll end up with a system that can do compares for any two dates,
> and could easily be expanded to do all sorts of deltas across these
> files.
> 
> The preprocess-the-files-to-add-a-date can probably be included as
> part of your mapper and isn't really a separate step - just depends on
> how easy it is to use one of the off-the-shelf mappers with your data.
> If it turns out to be its own step, it can become a very simple
> hadoop job.
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>

Re: Hadoop and Ganglia Meterics

2008-07-24 Thread Jason Venner


Check out

https://issues.apache.org/jira/browse/HADOOP-3422


Joe Williams wrote:
I have been attempting to get Hadoop metrics in Ganliga and have been 
unsuccessful thus far. I have see this thread 
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) 
but it didn't help much.


I have setup my properties file like so:


[EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=127.0.0.1:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=127.0.0.1:8649


And if I 'telnet 127.0.0.1  8649' I receive the Ganglia XML metrics 
output without any hadoop specific metrics:



[EMAIL PROTECTED] current]# telnet 127.0.0.1  8649
Trying 127.0.0.1...
Connected to localhost (127.0.0.1).
Escape character is '^]'.
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
!DOCTYPE GANGLIA_XML [
  !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
  !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
  !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
--SNIP--


Is there more I need to do to get the metrics to show up in this 
output, am I doing something incorrectly? Do I need to have a gmetric 
script run in a cron to update the stats? If so, does anyone have a 
hadoop specific example of this?


Any info would be helpful.

Thanks.
-Joe

Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber

I agree with you this is an acceptable method if time spent on exporting data 
from RDBM, importing file into HDFS and then importing data into RDBM again is 
considered as well, but this is an single-process/thread method. BTW, can you 
tell me how long does it take your method to process those 130 million rows, 
how much is the data volume, and how powerful are your physical computers, 
thanks a lot!
--
From: "Michael Lee" <[EMAIL PROTECTED]>
Sent: Thursday, July 24, 2008 11:51 AM
To: 
Subject: Re: Using MapReduce to do table comparing.

> Amber wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the 
>> table is exported as a csv text file, which is about 30GB in size, then the 
>> csv file is imported into a RDBMS table which is dropped and recreated every 
>> day. Now we want to find how many rows are updated during each export-import 
>> interval, the table has a primary key, so deletes and inserts can be found 
>> using RDBMS joins quickly, but we must do a column to column comparing in 
>> order to find the difference between rows ( about 90%) with the same primary 
>> keys. Our goal is to find a comparing process which takes no more than 10 
>> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
>> CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>> The old data is kept in the RDBMS with index created on the primary key, 
>> the new data is imported into HDFS as the input file of our Map-Reduce job. 
>> Every map task connects to the RDBMS database, and selects old data from it 
>> for every row, map tasks will generate outputs if differences are found, and 
>> there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the 
>> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
>> but we have no idea about how to retrieve the old row with a given key 
>> quickly from HDFS files, any suggestion is welcome.
> 10 million is not bad.  I do this all the time in UDB 8.1 - multiple key 
> columns and multiple value columns and calculate delta's - insert, 
> delete and update.
> 
> What other has suggested works ( I tried very crude version of what 
> James Moore suggested in Hadoop with 70+ million records ) but you have 
> to remember there are other costs ( dumping out files, putting into 
> HDFS, etc. ).  It might be better if you process straight in database or 
> do a straight file processing. Also the key is avoiding transaction.
> 
> If you are doing outside of database...
> 
> you have 'old.csv' and 'new.csv' and sorted by primary keys ( when you 
> extract make sure you do order by ).  In your application, you open two 
> file handlers and read one line at time.  Create the keys.  If the keys 
> are the same, you compare two strings if they are the same.  If key is 
> not the same, you have to find out natural orders - it can be insert or 
> delete.  Once you decide, you read another line ( if insert/delete - you 
> only read one line from one of the file )
> 
> Here is the pseudo code
> 
> oldFile = File.new(oldFilename, "r")
> newFile = File.new(newFilename, "r")
> outFile = File.new(outFilename, "w")
> 
> oldLine = oldFile.gets
> newLine = newFile.gets
> 
> while ( true )
> {
>oldKey = convertToKey(oldLine)
>newKey = convertToKey(newLine)
>   
>if ( oldKey < newKey )
>{
>   ## it is deletion
>   outFile.puts oldLine + "," + "DELETE";  
>   oldLine = oldFile.gets
>}
>elsif ( oldKey > newKey )
>{
>   ## it is insert
>   outFile.puts newLine + "," + "INSERT";
>   newLine = newFile.gets
>}
>else
>{
>   ## compare
>   outFile.puts newLine + "," + "UPDATE" if ( oldLine != newLine )
> 
>   oldLine = oldFile.gets
>   newLine = newFile.gets
>}
> }
> 
> Okay - I skipped the part if eof is reached for each file but you get 
> the point.
> 
> If the both old and new are in database, you can open two databases 
> connections and just do the process without dumping files.
> 
> I journal about 130 million rows every day for quant financial database...
> 
> 
> 
> 
> 
>

RE: distcp skipping the file

2008-07-24 Thread Murali Krishna

Hi,

> The -update behavior is by design.

If I am right, -update is to overwrite the file at the destination if it
is already there. But, in this case it is overwriting the folder as a
file at destination which seems to be a bug

 

> 

> Could you provide the command line, and the directory structure before

> and after issuing the copy? -C

 

Cmd is: hadoop distcp -update
'hftp://:50070/user//distcpsrc' distcp_dest

 

hadoop dfs -lsr distcpsrc  

/user//distcpsrc/12008-07-24 05:53

/user//distcpsrc/1/t  4   2008-07-22 06:12

 

hadoop dfs -lsr  distcp_dest

/user//distcp_dest/1  4   2008-07-24 06:03 <<
expected /user//distcp_dest/1/t, file is copied as '1' instead of
'1/t'

 

If I run without '-update', destination dir is:

hadoop dfs -lsr  distcp_dest_noupdate

/user//distcp_dest_noupdate/1 2008-07-24
06:08 << file 't' is not copied and '1' is directory

 

Thanks,

Murali

 

> 

> On Jul 22, 2008, at 9:46 PM, Murali Krishna wrote:

> 

> > Hi,

> >   I am using 0.15.3 and the destination is empty. One more

> > behavior that I am seeing is that if I pass '-update' option, it is

> > writing the content of file '2' in folder 1. (Makes the folder '1'
as

> > file in the destination). So, look like it is treating the
destination

> > for file distcpsrc/1/2 as distcpdest/1.

> >

> > Thanks,

> > Murali

> >

> >> -Original Message-

> >> From: Chris Douglas [mailto:[EMAIL PROTECTED]

> >> Sent: Wednesday, July 23, 2008 1:13 AM

> >> To: core-user@hadoop.apache.org

> >> Subject: Re: distcp skipping the file

> >>

> >> There were many fixes and improvements to distcp in 0.16, but most
of

> >> the critical fixes made it into 0.15.2 and 0.15.3. Is the
destination

> >> empty? Anything already existing at the destination is skipped. -C

> >>

> >> On Jul 22, 2008, at 4:39 AM, Murali Krishna wrote:

> >>

> >>> Hi,

> >>>

> >>> My source folder has a single folder and a single file inside
that.

> >>>

> >>> /user//distcpsrc/1/24   2008-07-22 04:22

> >>>

> >>> In the destination, it is creating the folder '1' but not the file

> >>> '2'.

> >>>

> >>> The counters show 1 file has been skipped.

> >>>

> >>> 08/07/22 04:22:36 INFO mapred.JobClient: Files skipped=1

> >>>

> >>>

> >>>

> >>> If I create one more file in any directory under the distscpsrc

> >>> folder,

> >>> it copies both the files properly. Is this a bug?

> >>>

> >>> [I am using 15.3]

> >>>

> >>>

> >>>

> >>> Thanks,

> >>>

> >>> Murali

> >>>

> >

hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka

Hi!

I'm experiencing hung reducers, with the following symptoms:

> Task Logs: 'task_200807230647_0008_r_09_1'
>
>
> stdout logs
>
>
>
> stderr logs
>
>
>
> syslog logs
>
> red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output
> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:16,074 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:21,084 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:26,094 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:31,104 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:36,114 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24
> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200807230647_0008_r_09_1 Got 0 known map output location(s);
> scheduling... 2008-07-24 07:56:41,126 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)


Notice how it needs 6 map outputs, all map tasks have finished, and it still 
just hangs there.

The second speculative copy of that reducer task needs 14 map outputs with the 
same messages :(

Other observations:

killing the reduce tasks via job -killtask ends up with restarting the job on 
the same node, and curiously the new job gets jammed at the same position  
(6/14 maps needed).

The only remedy to this problem seems to be a complete restart of the cluster 
and reprocessing. That gets really boring with jobs that took a day to 
process first :(

Andreas


signature.asc
Description:

40 matches

Mail list logo