Re: running hadoop on heterogeneous hardware

2009-01-22 Thread Steve Loughran

Bill Au wrote:

Is hadoop designed to run on homogeneous hardware only, or does it work just
as well on heterogeneous hardware as well?  If the datanodes have different
disk capacities, does HDFS still spread the data blocks equally amount all
the datanodes, or will the datanodes with high disk capacity end up storing
more data blocks?  Similarily, if the tasktrackres have different numbers of
CPUs, is there a way to configure hadoop to run more tasks on those
tasktrackers that have more CPUs?  Is that simply a matter of setting
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers?

Bill



Life is simpler on homogenous boxes; by setting the maximum tasks 
differently for the different machines, you do limit the amount of work 
that gets pushed out to those boxes. More troublesome is slower 
CPUs/HDDs, they arent picked up directly, though speculative work can 
handle some of this


One interesting bit of research would be something adaptive; something 
to monitor throughput and tune those values based on performance; that 
would detect variations in a cluster and work with with it, rather than 
requiring you  to know the capabilities of every machine.


-steve


Hadoop with many input/output files?

2009-01-22 Thread Zak, Richard [USA]
I am seeing the MultiFileInputFormat and the MultipleOutputFormat
Input/Output formats for the Job configuration.  How can I properly use
them?  I had previously used the default Input and Output Format types,
which for my PDF concatenation project, merely reduced Hadoop to a
scheduler.
 
The idea is per directory, to concatenate all PDFs in said directory to
one PDF, and for this I'm using iText.
 
How can I use these Format types?  What would be in my input into the
mapper and what would my InputKeyValue and OutputKeyValue classes be?
Thank you!  I can't find documentation on these other than the Javadoc,
which doesn't help much.
 
Richard J. Zak


Re: Hadoop with many input/output files?

2009-01-22 Thread Mark Kerzner
I have a very similar question: how do I recursively list all files in a
given directory, to the end that all files are processed by MapReduce? If I
just copy them to the output, let's say, is there any problem dropping them
all in the same output directory in HDFS? To use a bad example, Windows
chokes on many files in one directory.
Thank you,
Mark

On Thu, Jan 22, 2009 at 8:28 AM, Zak, Richard [USA] zak_rich...@bah.comwrote:

 I am seeing the MultiFileInputFormat and the MultipleOutputFormat
 Input/Output formats for the Job configuration.  How can I properly use
 them?  I had previously used the default Input and Output Format types,
 which for my PDF concatenation project, merely reduced Hadoop to a
 scheduler.

 The idea is per directory, to concatenate all PDFs in said directory to
 one PDF, and for this I'm using iText.

 How can I use these Format types?  What would be in my input into the
 mapper and what would my InputKeyValue and OutputKeyValue classes be?
 Thank you!  I can't find documentation on these other than the Javadoc,
 which doesn't help much.

 Richard J. Zak



Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Hello,

 

Any tips would be greatly appreciated.

 

Is there a way to set the order of the keys in reduce as shown below, no
matter what order the collection in MAP occurs in.

 

Thanks, Brian

 

 

public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
throws IOException {

 

//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);

 

 }

 

 

 

   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
reporter) throws IOException {

 

//always reduce CAT_A first, then reduce CAT_B

 

  }

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.



Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Hi Brian,

The CAT_A and CAT_B keys will be processed by different reducer
instances, so they run independently and may run in any order. What's
the output that you're trying to get?

Cheers,
Tom

On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below, no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.




Archive?

2009-01-22 Thread Mark Kerzner
Hi,
is there an archive to the messages? I am a newcomer, granted, but google
groups has all the discussion capabilities, and it has a searchable archive.
It is strange to have just a mailing list. Am I missing something?

Thank you,
Mark


Re: Archive?

2009-01-22 Thread Tom White
Hi Mark,

The archives are listed on http://wiki.apache.org/hadoop/MailingListArchives

Tom

On Thu, Jan 22, 2009 at 3:41 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,
 is there an archive to the messages? I am a newcomer, granted, but google
 groups has all the discussion capabilities, and it has a searchable archive.
 It is strange to have just a mailing list. Am I missing something?

 Thank you,
 Mark



RE: Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Hello Tom,

Would like to apply some rules To CAT_A, then use the output of CAT_A to
reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
reducers?


First Reducer processes CAT_A, then when complete second reducer does
CAT_B?

I suppose this would accomplish the same thing?



-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, January 22, 2009 10:41 AM
To: core-user@hadoop.apache.org
Subject: Re: Set the Order of the Keys in Reduce

Hi Brian,

The CAT_A and CAT_B keys will be processed by different reducer
instances, so they run independently and may run in any order. What's
the output that you're trying to get?

Cheers,
Tom

On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _

 The information transmitted is intended only for the person or entity
to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
received
 this message in error, please contact the sender and delete the
material
 from any computer.



_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.




Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Reducers run independently and without knowledge of one another, so
you can't get one reducer to depend on the output of another. I think
having two jobs is the simplest way to achieve what you're trying to
do.

Tom

On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello Tom,

 Would like to apply some rules To CAT_A, then use the output of CAT_A to
 reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
 reducers?


 First Reducer processes CAT_A, then when complete second reducer does
 CAT_B?

 I suppose this would accomplish the same thing?



 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, January 22, 2009 10:41 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Set the Order of the Keys in Reduce

 Hi Brian,

 The CAT_A and CAT_B keys will be processed by different reducer
 instances, so they run independently and may run in any order. What's
 the output that you're trying to get?

 Cheers,
 Tom

 On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
 brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
 no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _ _ _

 The information transmitted is intended only for the person or entity
 to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
 or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
 received
 this message in error, please contact the sender and delete the
 material
 from any computer.



 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.





Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Owen O'Malley


On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote:

Is there a way to set the order of the keys in reduce as shown  
below, no

matter what order the collection in MAP occurs in.


The keys to reduce are *always* sorted. If the default order is not  
correct, you can change the compare function.


As Tom points out, the critical thing is making sure that all of the  
keys that you need to group together go to the same reduce. So let's  
make it a little more concrete and say that you have:


public class TextPair implements Writable {
  public TextPair() {}
  public void set(String left, String right);
  public String getLeft();
  ...
}

And your map 0 does:
  key.set(CAT, B);
  output.collect(key, value);
  key.set(DOG, A);
  output.collect(key, value);

While map 1 does:
  key.set(CAT, A);
  output.collect(key, value);
  key.set(DOG,B);
  output.collect(key,value);

And you want to make sure that all of the cats go to the same reduces  
and that the dogs go to the same reduce, you would need to set the  
partitioner. It would look like:


public class MyPartitionerV implements PartitionerTextPair, V {

  public void configure(JobConf job) {}

  public int getPartition(TextPair key, V value,
 int numReduceTasks) {
return (key.getLeft().hashCode()  Integer.MAX_VALUE) %  
numReduceTasks;

  }
}

Then define a raw comparator that sorts based on both the left and  
right part of the TextPair, and you are set.


-- Owen


RE: Decommissioning Nodes

2009-01-22 Thread Rob Hamilton
I wasn't able to get decommissioning to work at all and found that just taking 
the node down got it out of the cluster. What version are you running and how 
are you initiating the decommissioning?

-Rob


Rob Hamilton - VP Network Operations 
P +1 (410) 379-2195 x 240   
E r...@lotame.com   
6085 Marshalee Drive, Suite 210   
Elkridge, MD 21075   


-Original Message-
From: Hargraves, Alyssa [mailto:aly...@wpi.edu] 
Sent: Wednesday, January 21, 2009 7:35 PM
To: core-user@hadoop.apache.org
Subject: Decommissioning Nodes

Hello Hadoop Users,

I was hoping someone would be able to answer a question about node 
decommissioning.  I have a test Hadoop cluster set up which only consists of my 
computer and a master node.  I am looking at the removal and addition of nodes. 
 Adding a node is nearly instant (only about 5 seconds), but removing a node by 
decommissioning it takes a while, and I don't understand why. Currently, the 
systems are running no map/reduce tasks and storing no data. DFS Health reports:

7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB 
(0%)
Capacity:   298.02 GB
DFS Remaining   :   245.79 GB
DFS Used:   4 KB
DFS Used%   :   0 %
Live Nodes  :   2
Dead Nodes  :   0

Node Last ContactAdmin State Size (GB)   Used (%)Used 
(%)Remaining (GB)  Blocks
master  0   In Service  149.01  0   
122.22  0
slave   82  Decommission In Progress149.01  0   
123.58  0 

However, even with nothing stored and nothing running, the decommission process 
takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move 
anywhere, and there aren't any jobs to worry about.  I am using 0.18.2.

Thank you for any help in solving this,
Alyssa Hargraves

The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.




RE: Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Owen, Thanks for joining in..

I suppose what is needed is a new config setting called
SequenceReducer.  In it you would specify multiple reducer classes in
the order you would like executed by JobTracker.   When Map completes,
MyReducerA.class would run, and in it would be specified the keys it
should reduce, not all existing. In Owen's example, this could be CAT.
When all instances of the MyReducerA complete reducing CAT, JobTracker
would move on to the next reducer in the list. MyReducerB could then
retrieve the values reduced down from CAT in HDFS  as a filter to
reduce DOG.

List list = new ArrayList();

List.add( MyReducerA.class ) //Reduces CAT
List.add( MyReducerB.class ) //Reduces DOG

conf.setSequenceReducer (list);


I agree with the previous posts and appreciate everyone insights and
participation.  What I proposed above is not simple. But when one
considers the size of the job, running it twice doesn't make a lot of
sense.  Should one rerun a 40 gb job file because the values reduced in
CAT are needed to filter the reduce of DOG? A better way must exist!

Owen, maybe I misunderstood your message, but it seems like even with
the addition of a partitioner and raw comparator Tom's post would still
prevent what I'm trying to do without having what is suggested above in
some fashion.

you can't get one reducer to depend on the output of another.


Thanks, Brian



-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, January 22, 2009 11:04 AM
To: core-user@hadoop.apache.org
Subject: Re: Set the Order of the Keys in Reduce

Reducers run independently and without knowledge of one another, so
you can't get one reducer to depend on the output of another. I think
having two jobs is the simplest way to achieve what you're trying to
do.

Tom

On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello Tom,

 Would like to apply some rules To CAT_A, then use the output of CAT_A
to
 reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
 reducers?


 First Reducer processes CAT_A, then when complete second reducer does
 CAT_B?

 I suppose this would accomplish the same thing?



 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, January 22, 2009 10:41 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Set the Order of the Keys in Reduce

 Hi Brian,

 The CAT_A and CAT_B keys will be processed by different reducer
 instances, so they run independently and may run in any order. What's
 the output that you're trying to get?

 Cheers,
 Tom

 On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
 brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
 no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _ _ _

 The information transmitted is intended only for the person or entity
 to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
 or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
 received
 this message in error, please contact the sender and delete the
 material
 from any computer.



 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _

 The information transmitted is intended only for the person or entity
to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
received
 this message in error, please contact the sender and delete the
material
 from any computer.




_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please 

RE: Decommissioning Nodes

2009-01-22 Thread Hargraves, Alyssa
I was following the steps at http://wiki.apache.org/hadoop/FAQ#17 to do the 
decommission.  However, you have to be patient with it since it seems to take a 
long time.  If it took 3-5 minutes with my nodes that have no data and no jobs 
running, I can't imagine how long it would be for a real cluster.  One thing 
that I had trouble with originally was the fact that it doesn't seem to work if 
your replication is set to be same as your number of machines (since I was just 
testing things, I had replication set to 2 with 2 machines, but that's not a 
good real-world example).

The problem I'm having though (from Jeremy's reply earlier it sounds like he 
misinterpreted it) isn't how long it is taking for the node to go from 
decommissioned to being recognized by the master as dead.  Whether or not it's 
recognized as dead isn't something that matters for what I'm doing.  The real 
problem is that going from the In Service to Decommissioned state is taking 
forever.  Decommission In Progress lasts 3 to 5 minutes despite the fact that 
there aren't jobs or data on those nodes.  If anyone else has any idea why that 
might be (I can see why it would take time if there are jobs or data, but not 
otherwise) please let me know.

- Alyssa

From: Rob Hamilton [...@lotame.com]
Sent: Thursday, January 22, 2009 12:26 PM
To: core-user@hadoop.apache.org
Subject: RE: Decommissioning Nodes

I wasn't able to get decommissioning to work at all and found that just taking 
the node down got it out of the cluster. What version are you running and how 
are you initiating the decommissioning?

-Rob


Rob Hamilton - VP Network Operations
P +1 (410) 379-2195 x 240
E r...@lotame.com
6085 Marshalee Drive, Suite 210
Elkridge, MD 21075


-Original Message-
From: Hargraves, Alyssa [mailto:aly...@wpi.edu]
Sent: Wednesday, January 21, 2009 7:35 PM
To: core-user@hadoop.apache.org
Subject: Decommissioning Nodes

Hello Hadoop Users,

I was hoping someone would be able to answer a question about node 
decommissioning.  I have a test Hadoop cluster set up which only consists of my 
computer and a master node.  I am looking at the removal and addition of nodes. 
 Adding a node is nearly instant (only about 5 seconds), but removing a node by 
decommissioning it takes a while, and I don't understand why. Currently, the 
systems are running no map/reduce tasks and storing no data. DFS Health reports:

7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB 
(0%)
Capacity:   298.02 GB
DFS Remaining   :   245.79 GB
DFS Used:   4 KB
DFS Used%   :   0 %
Live Nodes  :   2
Dead Nodes  :   0

Node Last ContactAdmin State Size (GB)   Used (%)Used 
(%)Remaining (GB)  Blocks
master  0   In Service  149.01  0
122.22  0
slave   82  Decommission In Progress149.01  0
123.58  0

However, even with nothing stored and nothing running, the decommission process 
takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move 
anywhere, and there aren't any jobs to worry about.  I am using 0.18.2.

Thank you for any help in solving this,
Alyssa Hargraves

The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.




watch out: Hadoop and Linux kernel 2.6.27

2009-01-22 Thread Peter Romianowski

Hi,

we just came across a very serious problem with Hadoop (and any other 
nio intense Java-application) and kernel 2.6.27.


Short story:
Increase epoll maximum_instances (/proc/sys/fs/epoll/max_user_instances) 
to prevent Too many open files errors regardless your ulimit -n settings.


Long story:
http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/


I just wanted to drop this note since it took us 2 days to figure it 
out... :(


Regards
Peter



Re: Decommissioning Nodes

2009-01-22 Thread Kumar Pandey
Can you try setting the following in hadoop-site.xml at the name node and
see if the time comes down to around a minute
property
 nameheartbeat.recheck.interval/name
 value1/value
 /property

This effectively
On Thu, Jan 22, 2009 at 9:42 AM, Hargraves, Alyssa aly...@wpi.edu wrote:

 I was following the steps at http://wiki.apache.org/hadoop/FAQ#17 to do
 the decommission.  However, you have to be patient with it since it seems to
 take a long time.  If it took 3-5 minutes with my nodes that have no data
 and no jobs running, I can't imagine how long it would be for a real
 cluster.  One thing that I had trouble with originally was the fact that it
 doesn't seem to work if your replication is set to be same as your number of
 machines (since I was just testing things, I had replication set to 2 with 2
 machines, but that's not a good real-world example).

 The problem I'm having though (from Jeremy's reply earlier it sounds like
 he misinterpreted it) isn't how long it is taking for the node to go from
 decommissioned to being recognized by the master as dead.  Whether or not
 it's recognized as dead isn't something that matters for what I'm doing.
  The real problem is that going from the In Service to Decommissioned state
 is taking forever.  Decommission In Progress lasts 3 to 5 minutes despite
 the fact that there aren't jobs or data on those nodes.  If anyone else has
 any idea why that might be (I can see why it would take time if there are
 jobs or data, but not otherwise) please let me know.

 - Alyssa
 
 From: Rob Hamilton [...@lotame.com]
 Sent: Thursday, January 22, 2009 12:26 PM
 To: core-user@hadoop.apache.org
 Subject: RE: Decommissioning Nodes

 I wasn't able to get decommissioning to work at all and found that just
 taking the node down got it out of the cluster. What version are you running
 and how are you initiating the decommissioning?

 -Rob


 Rob Hamilton - VP Network Operations
 P +1 (410) 379-2195 x 240
 E r...@lotame.com
 6085 Marshalee Drive, Suite 210
 Elkridge, MD 21075


 -Original Message-
 From: Hargraves, Alyssa [mailto:aly...@wpi.edu]
 Sent: Wednesday, January 21, 2009 7:35 PM
 To: core-user@hadoop.apache.org
 Subject: Decommissioning Nodes

 Hello Hadoop Users,

 I was hoping someone would be able to answer a question about node
 decommissioning.  I have a test Hadoop cluster set up which only consists of
 my computer and a master node.  I am looking at the removal and addition of
 nodes.  Adding a node is nearly instant (only about 5 seconds), but removing
 a node by decommissioning it takes a while, and I don't understand why.
 Currently, the systems are running no map/reduce tasks and storing no data.
 DFS Health reports:

 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31
 MB (0%)
 Capacity:   298.02 GB
 DFS Remaining   :   245.79 GB
 DFS Used:   4 KB
 DFS Used%   :   0 %
 Live Nodes  :   2
 Dead Nodes  :   0

 Node Last ContactAdmin State Size (GB)   Used (%)
  Used (%)Remaining (GB)  Blocks
 master  0   In Service  149.01  0
122.22  0
 slave   82  Decommission In Progress149.01  0
123.58  0

 However, even with nothing stored and nothing running, the decommission
 process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any
 data to move anywhere, and there aren't any jobs to worry about.  I am using
 0.18.2.

 Thank you for any help in solving this,
 Alyssa Hargraves

 The information transmitted in this email is intended only for the
 person(s) or entity to which it is addressed and may contain confidential
 and/or privileged material. Any review, retransmission, dissemination or
 other use of, or taking of any action in reliance upon, this information by
 persons or entities other than the intended recipient is prohibited. If you
 received this email in error, please contact the sender and permanently
 delete the email from any computer.





-- 
Kumar Pandey
http://www.linkedin.com/in/kumarpandey


FileOutputFormat.getWorkOutputPath and map-to-reduce-only side-effect files

2009-01-22 Thread Craig Macdonald

Hello Hadoop Core,

I have a very brief question: Our map tasks create side-effect files, in 
the directory returned by FileOutputFormat.getWorkOutputPath().


This works fine for the getting the side-effect files that can be 
accessed by the reducers.


However, as these map-generated side-effect files are only of use to the 
reducers, it would be nice to have them deleted from the output 
directory. However, we cant delete them in a reducer.close(), as this 
would prevent them being accessible to other reduce tasks (speculative 
or otherwise).


Any suggestions, short of deleting them after the job completes?

Craig


Re: using distcp for http source files

2009-01-22 Thread Doug Cutting

Aaron Kimball wrote:

Doesn't the WebDAV protocol use http for file transfer, and support reads /
writes / listings / etc?


Yes.  Getting a WebDAV-based FileSystem in Hadoop has long been a goal. 
 It could replace libhdfs, since there are already a WebDav-based FUSE 
filesystem for Linux (wdfs, davfs2).  WebDAV is also mountable from 
Windows, etc.



Is anyone aware of an OSS web dav library that
could be wrapped in a FileSystem implementation?


Yes, Apache Slide does but it's dead.  Apache Jackrabbit also does and 
it is alive (http://jackrabbit.apache.org/).


Doug


Re: Distributed cache testing in local mode

2009-01-22 Thread Aaron Kimball
Hi Bhupesh,

I've noticed the same problem -- LocalJobRunner makes the DistributedCache
effectively not work; so my code often winds up with two codepaths to
retrieve the local data :\

You could try running in pseudo-distributed mode to test, though then you
lose the ability to run a single-stepping debugger on the whole end-to-end
process.

- Aaron

On Thu, Jan 22, 2009 at 11:29 AM, Bhupesh Bansal bban...@linkedin.comwrote:

 Hey folks,

 I am trying to use Distributed cache in hadoop jobs to pass around
 configuration files , external-jars (job sepecific) and some archive data.

 I want to test Job end-to-end in local mode, but I think the distributed
 caches are localized in TaskTracker code which is not called in local mode
 Through LocalJobRunner.

 I can do some fairly simple workarounds for this but was just wondering if
 folks have more ideas about it.

 Thanks
 Bhupesh