Re: How to add nodes to existing cluster?

2009-01-30 Thread Amandeep Khurana
Thanks Lohit


On Fri, Jan 30, 2009 at 7:13 PM, lohit  wrote:

> Just starting DataNode and TaskTracker would add it to cluster.
> http://wiki.apache.org/hadoop/FAQ#25
>
> Lohit
>
>
>
> - Original Message 
> From: Amandeep Khurana 
> To: core-user@hadoop.apache.org
> Sent: Friday, January 30, 2009 6:55:00 PM
> Subject: How to add nodes to existing cluster?
>
> I am trying to add nodes to an existing working cluster. Do I need to bring
> the entire cluster down or just shutting down and restarting the namenode
> after adding the new machine list to the slaves would work?
>
> Amandeep
>
>


Re: How to add nodes to existing cluster?

2009-01-30 Thread lohit
Just starting DataNode and TaskTracker would add it to cluster.
http://wiki.apache.org/hadoop/FAQ#25

Lohit



- Original Message 
From: Amandeep Khurana 
To: core-user@hadoop.apache.org
Sent: Friday, January 30, 2009 6:55:00 PM
Subject: How to add nodes to existing cluster?

I am trying to add nodes to an existing working cluster. Do I need to bring
the entire cluster down or just shutting down and restarting the namenode
after adding the new machine list to the slaves would work?

Amandeep



How to add nodes to existing cluster?

2009-01-30 Thread Amandeep Khurana
I am trying to add nodes to an existing working cluster. Do I need to bring
the entire cluster down or just shutting down and restarting the namenode
after adding the new machine list to the slaves would work?

Amandeep


Acquiring Node Info / Statistics

2009-01-30 Thread Crane, Patrick
Hello, 
 
  If I wanted to acquire information on a node (let's say Datanode)
similar to what is displayed on the dfshealth.jsp status page, how would
I go about doing so?  If I try to mimic the functions that it uses,
problems arise when initializing the classes which are mostly
"protected".  It seems like it should be a simpler matter to access this
kind of information other than taking it from the JSP page; am I just
missing something very basic here?
 
Thanks in advance,
 
Patrick


Re: settin JAVA_HOME...

2009-01-30 Thread Sandy
Hi Zander,

Do not use jdk. Horrific things happen. You must use sun java in order to
use hadoop.

There are packages for sun java on the ubuntu repository. You can download
these directly using apt-get. This will install java 6 on your system.
Your JAVA_HOME line in hadoop-env.sh should look like:
export JAVA_HOME=/usr/lib/jvm/java-6-sun

Also, on the wiki, there is a guide for installing hadoop on ubuntu systems.
I think you may find this helpful.
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

All the best!
-SM


On Fri, Jan 30, 2009 at 4:33 PM, zander1013  wrote:

>
> i am installing "default-jdk" now. perhaps that was the problem. is this
> the
> right jdk?
>
>
>
> zander1013 wrote:
> >
> > cool!
> >
> > here is the output for those commands...
> >
> > a...@node0:~/Hadoop/hadoop-0.19.0$ which java
> > /usr/bin/java
> > a...@node0:~/Hadoop/hadoop-0.19.0$
> > a...@node0:~/Hadoop/hadoop-0.19.0$ ls -l /usr/bin/java
> > lrwxrwxrwx 1 root root 22 2009-01-29 18:03 /usr/bin/java ->
> > /etc/alternatives/java
> > a...@node0:~/Hadoop/hadoop-0.19.0$
> >
> > ... i will try and set JAVA_HOME=/etc/alternatives/java...
> >
> > thank you for helping...
> >
> > -zander
> >
> >
> > Mark Kerzner-2 wrote:
> >>
> >> Oh, you have used my path to JDK, you need yours
> >> do this
> >>
> >> which java
> >> something like /usr/bin/java will come back
> >>
> >> then do
> >> ls -l /usr/bin/java
> >>
> >> it will tell you where the link is to. There may be more redirections,
> >> get
> >> the real path to your JDK
> >>
> >> On Fri, Jan 30, 2009 at 4:09 PM, zander1013 
> wrote:
> >>
> >>>
> >>> okay,
> >>>
> >>> here is the section for conf/hadoop-env.sh...
> >>>
> >>> # Set Hadoop-specific environment variables here.
> >>>
> >>> # The only required environment variable is JAVA_HOME.  All others are
> >>> # optional.  When running a distributed configuration it is best to
> >>> # set JAVA_HOME in this file, so that it is correctly defined on
> >>> # remote nodes.
> >>>
> >>> # The java implementation to use.  Required.
> >>> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> >>> export JAVA_HOME=/usr/lib/jvm/default-java
> >>>
> >>> ...
> >>>
> >>> and here is what i got for output. i am trying to go through the
> >>> tutorial
> >>> at
> >>> http://hadoop.apache.org/core/docs/current/quickstart.html
> >>>
> >>> here is the output...
> >>>
> >>> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar
> >>> grep
> >>> input output 'dfs[a-z.]+'
> >>> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file
> >>> or
> >>> directory
> >>> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file
> >>> or
> >>> directory
> >>> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
> >>> execute: No such file or directory
> >>> a...@node0:~/Hadoop/hadoop-0.19.0$
> >>>
> >>> ...
> >>>
> >>> please advise...
> >>>
> >>>
> >>>
> >>>
> >>> Mark Kerzner-2 wrote:
> >>> >
> >>> > You set it in the conf/hadoop-env.sh file, with an entry like this
> >>> > export JAVA_HOME=/usr/lib/jvm/default-java
> >>> >
> >>> > Mark
> >>> >
> >>> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
> >>> wrote:
> >>> >
> >>> >>
> >>> >> hi,
> >>> >>
> >>> >> i am new to hadoop. i am trying to set it up for the first time as a
> >>> >> single
> >>> >> node cluster. at present the snag is that i cannot seem to find the
> >>> >> correct
> >>> >> path for setting the JAVA_HOME variable.
> >>> >>
> >>> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
> >>> >> setting
> >>> >> the variable to point to those places (except the dir where i have
> >>> >> hadoop).
> >>> >>
> >>> >> please advise.
> >>> >>
> >>> >> -zander
> >>> >> --
> >>> >> View this message in context:
> >>> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
> >>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>>
> >>> --
> >>> View this message in context:
> >>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
> >>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756916.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: extra documentation on how to write your own partitioner class

2009-01-30 Thread Sandy
Hi James,

Thank you very much! :-)

-SM

On Fri, Jan 30, 2009 at 4:17 PM, james warren  wrote:

> Hello Sandy -
> Your partitioner isn't using any information from the key/value pair - it's
> only using the value T which is read once from the job configuration.
>  getPartition() will always return the same value, so all of your data is
> being sent to one reducer. :P
>
> cheers,
> -James
>
> On Fri, Jan 30, 2009 at 1:32 PM, Sandy  wrote:
>
> > Hello,
> >
> > Could someone point me toward some more documentation on how to write
> one's
> > own partition class? I have having quite a bit of trouble getting mine to
> > work. So far, it looks something like this:
> >
> > public class myPartitioner extends MapReduceBase implements
> > Partitioner {
> >
> >private int T;
> >
> >public void configure(JobConf job) {
> >super.configure(job);
> >String myT = job.get("tval");//this is user defined
> >T = Integer.parseInt(myT);
> >}
> >
> >public int getPartition(IntWritable key, IntWritable value, int
> > numReduceTasks) {
> >int newT = (T/numReduceTasks);
> >int id = ((value.get()/ T);
> >return (int)(id/newT);
> >}
> > }
> >
> > In the run() function of my M/R program I just set it using:
> >
> > conf.setPartitionerClass(myPartitioner.class);
> >
> > Is there anything else I need to set in the run() function?
> >
> >
> > The code compiles fine. When I run it, I know it is "using" the
> > partitioner,
> > since I get different output than if I just let it use HashPartitioner.
> > However, it is not splitting between the reducers at all! If I set the
> > number of reducers to 2, all the output shows up in part-0, while
> > part-1 has nothing.
> >
> > I am having trouble debugging this since I don't know how I can observe
> the
> > values of numReduceTasks (which I assume is being set by the system). Is
> > this a proper assumption?
> >
> > If I try to insert any println() statements in the function, it isn't
> > outputted to either my terminal or my log files. Could someone give me
> some
> > general advice on how best to debug pieces of code like this?
> >
>


Re: job management in Hadoop

2009-01-30 Thread Bhupesh Bansal
Bill, 

Currently you can kill the job from the UI.
You have to enable the config in hadoop-default.xml

  webinterface.private.actions to be true

Best
Bhupesh


On 1/30/09 3:23 PM, "Bill Au"  wrote:

> Thanks.
> 
> Anyone knows if there is plan to add this functionality to the web UI like
> job priority can be changed from both the command line and the web UI?
> 
> Bill
> 
> On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy  wrote:
> 
>> 
>> On Jan 30, 2009, at 2:41 PM, Bill Au wrote:
>> 
>>  Is there any way to cancel a job after it has been submitted?
>>> 
>>> 
>> bin/hadoop job -kill 
>> 
>> Arun
>> 



Re: job management in Hadoop

2009-01-30 Thread Bill Au
Thanks.

Anyone knows if there is plan to add this functionality to the web UI like
job priority can be changed from both the command line and the web UI?

Bill

On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy  wrote:

>
> On Jan 30, 2009, at 2:41 PM, Bill Au wrote:
>
>  Is there any way to cancel a job after it has been submitted?
>>
>>
> bin/hadoop job -kill 
>
> Arun
>


Re: Setting up cluster

2009-01-30 Thread Amandeep Khurana
Here's the log from the datanode:

2009-01-30 14:54:18,019 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rndpc1/171.69.102.51:9000. Already tried 8 time(s).
2009-01-30 14:54:19,022 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rndpc1/171.69.102.51:9000. Already tried 9 time(s).
2009-01-30 14:54:19,026 ERROR org.apache.hadoop.dfs.DataNode:
java.io.IOException: Call failed on local exception
at org.apache.hadoop.ipc.Client.call(Client.java:718)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:277)
at org.apache.hadoop.dfs.DataNode.(DataNode.java:223)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3031)
at
org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2986)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2994)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3116)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at sun.nio.ch.SocketAdaptor.connect(Unknown Source)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:300)
at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:177)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:789)
at org.apache.hadoop.ipc.Client.call(Client.java:704)
... 12 more

What do I need to do for this?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Jan 30, 2009 at 2:49 PM, Amandeep Khurana  wrote:

> Hi,
>
> I am a new user and was setting up the HDFS on 3 nodes as of now. I could
> get them to run individual pseudo distributed setups but am unable to get
> the cluster going together. The site localhost:50070 shows me that there are
> no datanodes.
>
> I kept the same hadoop-site.xml as the pseudodistributed setup on the
> master node and added the slaves to the list of slaves in the conf
> directory. Thereafter, I ran the start-dfs.sh and start-mapred.sh scripts.
>
> Am I missing something out?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


Re: job management in Hadoop

2009-01-30 Thread Arun C Murthy


On Jan 30, 2009, at 2:41 PM, Bill Au wrote:


Is there any way to cancel a job after it has been submitted?



bin/hadoop job -kill 

Arun


Setting up cluster

2009-01-30 Thread Amandeep Khurana
Hi,

I am a new user and was setting up the HDFS on 3 nodes as of now. I could
get them to run individual pseudo distributed setups but am unable to get
the cluster going together. The site localhost:50070 shows me that there are
no datanodes.

I kept the same hadoop-site.xml as the pseudodistributed setup on the master
node and added the slaves to the list of slaves in the conf directory.
Thereafter, I ran the start-dfs.sh and start-mapred.sh scripts.

Am I missing something out?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


job management in Hadoop

2009-01-30 Thread Bill Au
Is there any way to cancel a job after it has been submitted?

Bill


Re: settin JAVA_HOME...

2009-01-30 Thread zander1013

i am installing "default-jdk" now. perhaps that was the problem. is this the
right jdk?



zander1013 wrote:
> 
> cool!
> 
> here is the output for those commands...
> 
> a...@node0:~/Hadoop/hadoop-0.19.0$ which java
> /usr/bin/java
> a...@node0:~/Hadoop/hadoop-0.19.0$ 
> a...@node0:~/Hadoop/hadoop-0.19.0$ ls -l /usr/bin/java 
> lrwxrwxrwx 1 root root 22 2009-01-29 18:03 /usr/bin/java ->
> /etc/alternatives/java
> a...@node0:~/Hadoop/hadoop-0.19.0$ 
> 
> ... i will try and set JAVA_HOME=/etc/alternatives/java...
> 
> thank you for helping...
> 
> -zander
> 
> 
> Mark Kerzner-2 wrote:
>> 
>> Oh, you have used my path to JDK, you need yours
>> do this
>> 
>> which java
>> something like /usr/bin/java will come back
>> 
>> then do
>> ls -l /usr/bin/java
>> 
>> it will tell you where the link is to. There may be more redirections,
>> get
>> the real path to your JDK
>> 
>> On Fri, Jan 30, 2009 at 4:09 PM, zander1013  wrote:
>> 
>>>
>>> okay,
>>>
>>> here is the section for conf/hadoop-env.sh...
>>>
>>> # Set Hadoop-specific environment variables here.
>>>
>>> # The only required environment variable is JAVA_HOME.  All others are
>>> # optional.  When running a distributed configuration it is best to
>>> # set JAVA_HOME in this file, so that it is correctly defined on
>>> # remote nodes.
>>>
>>> # The java implementation to use.  Required.
>>> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
>>> export JAVA_HOME=/usr/lib/jvm/default-java
>>>
>>> ...
>>>
>>> and here is what i got for output. i am trying to go through the
>>> tutorial
>>> at
>>> http://hadoop.apache.org/core/docs/current/quickstart.html
>>>
>>> here is the output...
>>>
>>> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar
>>> grep
>>> input output 'dfs[a-z.]+'
>>> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file
>>> or
>>> directory
>>> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file
>>> or
>>> directory
>>> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
>>> execute: No such file or directory
>>> a...@node0:~/Hadoop/hadoop-0.19.0$
>>>
>>> ...
>>>
>>> please advise...
>>>
>>>
>>>
>>>
>>> Mark Kerzner-2 wrote:
>>> >
>>> > You set it in the conf/hadoop-env.sh file, with an entry like this
>>> > export JAVA_HOME=/usr/lib/jvm/default-java
>>> >
>>> > Mark
>>> >
>>> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
>>> wrote:
>>> >
>>> >>
>>> >> hi,
>>> >>
>>> >> i am new to hadoop. i am trying to set it up for the first time as a
>>> >> single
>>> >> node cluster. at present the snag is that i cannot seem to find the
>>> >> correct
>>> >> path for setting the JAVA_HOME variable.
>>> >>
>>> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
>>> >> setting
>>> >> the variable to point to those places (except the dir where i have
>>> >> hadoop).
>>> >>
>>> >> please advise.
>>> >>
>>> >> -zander
>>> >> --
>>> >> View this message in context:
>>> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
>>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756916.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: settin JAVA_HOME...

2009-01-30 Thread zander1013

yes i am trying to locate java's location on  my machine... i installed
sun-java6-jre using the synaptic package manager...

here is the output from the tutorial after i tried to locate java using
"which java" "ls..." etc and put that into the .sh file...

a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar grep
input output 'dfs[a-z.]+'
bin/hadoop: line 243: /etc/alternatives/java/bin/java: Not a directory
bin/hadoop: line 273: /etc/alternatives/java/bin/java: Not a directory
bin/hadoop: line 273: exec: /etc/alternatives/java/bin/java: cannot execute:
Not a directory
a...@node0:~/Hadoop/hadoop-0.19.0$ 

please advise.


Bill Au wrote:
> 
> You actually have to set JAVA_HOME to where Java is actually installed on
> your system.  "/usr/lib/jvm/default-java" is just an example.  The error
> messages indicate that that's not where Java is installed on your system.
> 
> Bill
> 
> On Fri, Jan 30, 2009 at 5:09 PM, zander1013  wrote:
> 
>>
>> okay,
>>
>> here is the section for conf/hadoop-env.sh...
>>
>> # Set Hadoop-specific environment variables here.
>>
>> # The only required environment variable is JAVA_HOME.  All others are
>> # optional.  When running a distributed configuration it is best to
>> # set JAVA_HOME in this file, so that it is correctly defined on
>> # remote nodes.
>>
>> # The java implementation to use.  Required.
>> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
>> export JAVA_HOME=/usr/lib/jvm/default-java
>>
>> ...
>>
>> and here is what i got for output. i am trying to go through the tutorial
>> at
>> http://hadoop.apache.org/core/docs/current/quickstart.html
>>
>> here is the output...
>>
>> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar
>> grep
>> input output 'dfs[a-z.]+'
>> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file or
>> directory
>> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or
>> directory
>> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
>> execute: No such file or directory
>> a...@node0:~/Hadoop/hadoop-0.19.0$
>>
>> ...
>>
>> please advise...
>>
>>
>>
>>
>> Mark Kerzner-2 wrote:
>> >
>> > You set it in the conf/hadoop-env.sh file, with an entry like this
>> > export JAVA_HOME=/usr/lib/jvm/default-java
>> >
>> > Mark
>> >
>> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
>> wrote:
>> >
>> >>
>> >> hi,
>> >>
>> >> i am new to hadoop. i am trying to set it up for the first time as a
>> >> single
>> >> node cluster. at present the snag is that i cannot seem to find the
>> >> correct
>> >> path for setting the JAVA_HOME variable.
>> >>
>> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
>> >> setting
>> >> the variable to point to those places (except the dir where i have
>> >> hadoop).
>> >>
>> >> please advise.
>> >>
>> >> -zander
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756798.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: settin JAVA_HOME...

2009-01-30 Thread zander1013

cool!

here is the output for those commands...

a...@node0:~/Hadoop/hadoop-0.19.0$ which java
/usr/bin/java
a...@node0:~/Hadoop/hadoop-0.19.0$ 
a...@node0:~/Hadoop/hadoop-0.19.0$ ls -l /usr/bin/java 
lrwxrwxrwx 1 root root 22 2009-01-29 18:03 /usr/bin/java ->
/etc/alternatives/java
a...@node0:~/Hadoop/hadoop-0.19.0$ 

... i will try and set JAVA_HOME=/etc/alternatives/java...

thank you for helping...

-zander


Mark Kerzner-2 wrote:
> 
> Oh, you have used my path to JDK, you need yours
> do this
> 
> which java
> something like /usr/bin/java will come back
> 
> then do
> ls -l /usr/bin/java
> 
> it will tell you where the link is to. There may be more redirections, get
> the real path to your JDK
> 
> On Fri, Jan 30, 2009 at 4:09 PM, zander1013  wrote:
> 
>>
>> okay,
>>
>> here is the section for conf/hadoop-env.sh...
>>
>> # Set Hadoop-specific environment variables here.
>>
>> # The only required environment variable is JAVA_HOME.  All others are
>> # optional.  When running a distributed configuration it is best to
>> # set JAVA_HOME in this file, so that it is correctly defined on
>> # remote nodes.
>>
>> # The java implementation to use.  Required.
>> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
>> export JAVA_HOME=/usr/lib/jvm/default-java
>>
>> ...
>>
>> and here is what i got for output. i am trying to go through the tutorial
>> at
>> http://hadoop.apache.org/core/docs/current/quickstart.html
>>
>> here is the output...
>>
>> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar
>> grep
>> input output 'dfs[a-z.]+'
>> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file or
>> directory
>> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or
>> directory
>> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
>> execute: No such file or directory
>> a...@node0:~/Hadoop/hadoop-0.19.0$
>>
>> ...
>>
>> please advise...
>>
>>
>>
>>
>> Mark Kerzner-2 wrote:
>> >
>> > You set it in the conf/hadoop-env.sh file, with an entry like this
>> > export JAVA_HOME=/usr/lib/jvm/default-java
>> >
>> > Mark
>> >
>> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
>> wrote:
>> >
>> >>
>> >> hi,
>> >>
>> >> i am new to hadoop. i am trying to set it up for the first time as a
>> >> single
>> >> node cluster. at present the snag is that i cannot seem to find the
>> >> correct
>> >> path for setting the JAVA_HOME variable.
>> >>
>> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
>> >> setting
>> >> the variable to point to those places (except the dir where i have
>> >> hadoop).
>> >>
>> >> please advise.
>> >>
>> >> -zander
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756710.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: settin JAVA_HOME...

2009-01-30 Thread Bill Au
You actually have to set JAVA_HOME to where Java is actually installed on
your system.  "/usr/lib/jvm/default-java" is just an example.  The error
messages indicate that that's not where Java is installed on your system.

Bill

On Fri, Jan 30, 2009 at 5:09 PM, zander1013  wrote:

>
> okay,
>
> here is the section for conf/hadoop-env.sh...
>
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> export JAVA_HOME=/usr/lib/jvm/default-java
>
> ...
>
> and here is what i got for output. i am trying to go through the tutorial
> at
> http://hadoop.apache.org/core/docs/current/quickstart.html
>
> here is the output...
>
> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar grep
> input output 'dfs[a-z.]+'
> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file or
> directory
> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or
> directory
> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
> execute: No such file or directory
> a...@node0:~/Hadoop/hadoop-0.19.0$
>
> ...
>
> please advise...
>
>
>
>
> Mark Kerzner-2 wrote:
> >
> > You set it in the conf/hadoop-env.sh file, with an entry like this
> > export JAVA_HOME=/usr/lib/jvm/default-java
> >
> > Mark
> >
> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
> wrote:
> >
> >>
> >> hi,
> >>
> >> i am new to hadoop. i am trying to set it up for the first time as a
> >> single
> >> node cluster. at present the snag is that i cannot seem to find the
> >> correct
> >> path for setting the JAVA_HOME variable.
> >>
> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
> >> setting
> >> the variable to point to those places (except the dir where i have
> >> hadoop).
> >>
> >> please advise.
> >>
> >> -zander
> >> --
> >> View this message in context:
> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: extra documentation on how to write your own partitioner class

2009-01-30 Thread james warren
Hello Sandy -
Your partitioner isn't using any information from the key/value pair - it's
only using the value T which is read once from the job configuration.
 getPartition() will always return the same value, so all of your data is
being sent to one reducer. :P

cheers,
-James

On Fri, Jan 30, 2009 at 1:32 PM, Sandy  wrote:

> Hello,
>
> Could someone point me toward some more documentation on how to write one's
> own partition class? I have having quite a bit of trouble getting mine to
> work. So far, it looks something like this:
>
> public class myPartitioner extends MapReduceBase implements
> Partitioner {
>
>private int T;
>
>public void configure(JobConf job) {
>super.configure(job);
>String myT = job.get("tval");//this is user defined
>T = Integer.parseInt(myT);
>}
>
>public int getPartition(IntWritable key, IntWritable value, int
> numReduceTasks) {
>int newT = (T/numReduceTasks);
>int id = ((value.get()/ T);
>return (int)(id/newT);
>}
> }
>
> In the run() function of my M/R program I just set it using:
>
> conf.setPartitionerClass(myPartitioner.class);
>
> Is there anything else I need to set in the run() function?
>
>
> The code compiles fine. When I run it, I know it is "using" the
> partitioner,
> since I get different output than if I just let it use HashPartitioner.
> However, it is not splitting between the reducers at all! If I set the
> number of reducers to 2, all the output shows up in part-0, while
> part-1 has nothing.
>
> I am having trouble debugging this since I don't know how I can observe the
> values of numReduceTasks (which I assume is being set by the system). Is
> this a proper assumption?
>
> If I try to insert any println() statements in the function, it isn't
> outputted to either my terminal or my log files. Could someone give me some
> general advice on how best to debug pieces of code like this?
>


Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
Oh, you have used my path to JDK, you need yours
do this

which java
something like /usr/bin/java will come back

then do
ls -l /usr/bin/java

it will tell you where the link is to. There may be more redirections, get
the real path to your JDK

On Fri, Jan 30, 2009 at 4:09 PM, zander1013  wrote:

>
> okay,
>
> here is the section for conf/hadoop-env.sh...
>
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> export JAVA_HOME=/usr/lib/jvm/default-java
>
> ...
>
> and here is what i got for output. i am trying to go through the tutorial
> at
> http://hadoop.apache.org/core/docs/current/quickstart.html
>
> here is the output...
>
> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar grep
> input output 'dfs[a-z.]+'
> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file or
> directory
> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or
> directory
> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
> execute: No such file or directory
> a...@node0:~/Hadoop/hadoop-0.19.0$
>
> ...
>
> please advise...
>
>
>
>
> Mark Kerzner-2 wrote:
> >
> > You set it in the conf/hadoop-env.sh file, with an entry like this
> > export JAVA_HOME=/usr/lib/jvm/default-java
> >
> > Mark
> >
> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
> wrote:
> >
> >>
> >> hi,
> >>
> >> i am new to hadoop. i am trying to set it up for the first time as a
> >> single
> >> node cluster. at present the snag is that i cannot seem to find the
> >> correct
> >> path for setting the JAVA_HOME variable.
> >>
> >> i am using ubuntu 8.10. i have tried using "whereis java" and tried
> >> setting
> >> the variable to point to those places (except the dir where i have
> >> hadoop).
> >>
> >> please advise.
> >>
> >> -zander
> >> --
> >> View this message in context:
> >> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: settin JAVA_HOME...

2009-01-30 Thread zander1013

okay,

here is the section for conf/hadoop-env.sh...

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/default-java

...

and here is what i got for output. i am trying to go through the tutorial at
http://hadoop.apache.org/core/docs/current/quickstart.html

here is the output...

a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar hadoop-*-examples.jar grep
input output 'dfs[a-z.]+'
bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such file or
directory
bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such file or
directory
bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
execute: No such file or directory
a...@node0:~/Hadoop/hadoop-0.19.0$ 

...

please advise...




Mark Kerzner-2 wrote:
> 
> You set it in the conf/hadoop-env.sh file, with an entry like this
> export JAVA_HOME=/usr/lib/jvm/default-java
> 
> Mark
> 
> On Fri, Jan 30, 2009 at 3:49 PM, zander1013  wrote:
> 
>>
>> hi,
>>
>> i am new to hadoop. i am trying to set it up for the first time as a
>> single
>> node cluster. at present the snag is that i cannot seem to find the
>> correct
>> path for setting the JAVA_HOME variable.
>>
>> i am using ubuntu 8.10. i have tried using "whereis java" and tried
>> setting
>> the variable to point to those places (except the dir where i have
>> hadoop).
>>
>> please advise.
>>
>> -zander
>> --
>> View this message in context:
>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
You set it in the conf/hadoop-env.sh file, with an entry like this
export JAVA_HOME=/usr/lib/jvm/default-java

Mark

On Fri, Jan 30, 2009 at 3:49 PM, zander1013  wrote:

>
> hi,
>
> i am new to hadoop. i am trying to set it up for the first time as a single
> node cluster. at present the snag is that i cannot seem to find the correct
> path for setting the JAVA_HOME variable.
>
> i am using ubuntu 8.10. i have tried using "whereis java" and tried setting
> the variable to point to those places (except the dir where i have hadoop).
>
> please advise.
>
> -zander
> --
> View this message in context:
> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


settin JAVA_HOME...

2009-01-30 Thread zander1013

hi,

i am new to hadoop. i am trying to set it up for the first time as a single
node cluster. at present the snag is that i cannot seem to find the correct
path for setting the JAVA_HOME variable.

i am using ubuntu 8.10. i have tried using "whereis java" and tried setting
the variable to point to those places (except the dir where i have hadoop).

please advise.

-zander
-- 
View this message in context: 
http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Hadoop Streaming Semantics

2009-01-30 Thread S D
Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> You can use NLineInputFormat for this, which splits one line (N=1, by
> default) as one split.
> So, each map task processes one line.
> See
> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> -Amareshwari
>
> S D wrote:
>
>> Hello,
>>
>> I have a clarifying question about Hadoop streaming. I'm new to the list
>> and
>> didn't see anything posted that covers my questions - my apologies if I
>> overlooked a relevant post.
>>
>> I have an input file consisting of a list of files (one per line) that
>> need
>> to be processed independently of each other. The duration for processing
>> each file is significant - perhaps an hour each. I'm using Hadoop
>> streaming
>> without a reduce function to process each file and save the results (back
>> to
>> S3 native in my case). To handle to long processing time of each file I've
>> set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
>> reading from STDIN:
>>
>> STDIN.each_line do |line|
>>   # Get file from contents of line
>>   # Process file (long running)
>> end
>>
>> Currently I'm using a cluster of 3 workers in which each worker can have
>> up
>> to 2 tasks running simultaneously. I've noticed that if I have a single
>> input file with many lines (more than 6 given my cluster), then not all
>> workers will be allocated tasks; I've noticed two workers being allocated
>> one task each and the other worker sitting idly. If I split my input file
>> into multiple files (at least 6) then all workers will be immediately
>> allocated the maximum number of tasks that they can handle.
>>
>> My interpretation on this is fuzzy. It seems that Hadoop streaming will
>> take
>> separate input files and allocate a new task per file (up to the maximum
>> constraint) but if given a single input file it is unclear as to whether a
>> new task is allocated per file or line. My understanding of Hadoop Java is
>> that (unlike Hadoop streaming) when given a single input file, the file
>> will
>> be broken up into separate lines and the maximum number of map tasks will
>> automagically be allocated to handle the lines of the file (assuming the
>> use
>> of TextInputFormat).
>>
>> Can someone clarify this?
>>
>> Thanks,
>> SD
>>
>>
>>
>
>


Re: 2009 Hadoop Summit?

2009-01-30 Thread Bill Au
JavaOne is scheduled for the first week on June this year.  Please keep that
in mind since I am guessing I am not the only one who are interested in
both.

Bill

On Thu, Jan 29, 2009 at 7:45 PM, Ajay Anand  wrote:

> Yes! We are planning one for the first week of June. I will be sending
> out a note inviting talks and reaching out to presenters over the next
> couple of days. You read my mind :)
> Stay tuned.
>
> Ajay
>
> -Original Message-
> From: Bradford Stephens [mailto:bradfordsteph...@gmail.com]
> Sent: Thursday, January 29, 2009 4:34 PM
> To: core-user@hadoop.apache.org
> Subject: 2009 Hadoop Summit?
>
> Hey there,
>
> I was just wondering if there's plans for another Hadoop Summit this
> year? I went last March and learned quite a bit -- I'm excited to see
> what new things people have done since then.
>
> Cheers,
> Bradford
>


Re: How To Encrypt Hadoop Socket Connections

2009-01-30 Thread Allen Wittenauer



On 1/30/09 6:24 AM, "Brian MacKay"  wrote:
> https://issues.apache.org/jira/browse/HADOOP-2239

This ended up getting turned into "encrypting distcp" and not actually
encrypting intra-grid socket connections. (Can we "rename" a JIRA?)  If you
need that capability today, your best bet is likely IPsec or something
similar.

Hadoop has lots and lots of security holes and this is but one.



Re: Cannot run program "chmod": error=12, Not enough space

2009-01-30 Thread Allen Wittenauer
On 1/28/09 7:42 PM, "Andy Liu"  wrote:
> I'm running Hadoop 0.19.0 on Solaris (SunOS 5.10 on x86) and many jobs are
> failing with this exception:
> 
> Error initializing attempt_200901281655_0004_m_25_0:
> java.io.IOException: Cannot run program "chmod": error=12, Not enough space
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
...
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:53)
> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> ... 20 more
> 
> However, all the disks have plenty of disk space left (over 800 gigs).  Can
> somebody point me in the right direction?

"Not enough space" is usually SysV kernel speak for "not enough virtual
memory to swap".  See how much mem you have free.




Re: How does Hadoop choose machines for Reducers?

2009-01-30 Thread Nathan Marz
This is a huge problem for my application. I tried setting  
mapred.tasktracker.reduce.tasks.maximum to 1 in the job's JobConf, but  
that didn't have any effect. I'm using a custom output format and it's  
essential that Hadoop distribute the reduce tasks to make use of all  
the machine's as there is contention when multiple reduce tasks run on  
one machine. Since my number of reduce tasks is guaranteed to be less  
than the number of machines in the cluster, there's no reason for  
Hadoop not to make use of the full cluster.


Does anyone know of a way to force Hadoop to distribute reduce tasks  
evenly across all the machines?



On Jan 30, 2009, at 7:32 AM, jason hadoop wrote:

Hadoop just distributes to the available reduce execution slots. I  
don't

believe it pays attention to what machine they are on.
I believe the plan is to take account data locality in future (ie:
distribute tasks to machines that are considered more topologically  
close to
their input split first, but I don't think this is available to most  
users.)



On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz   
wrote:


I have a MapReduce application in which I configure 16 reducers to  
run on
15 machines. My mappers output exactly 16 keys, IntWritable's from  
0 to 15.
However, only 12 out of the 15 machines are used to run the 16  
reducers (4
machines have 2 reducers running on each). Is there a way to get  
Hadoop to

use all the machines for reducing?





Re: Question about HDFS capacity and remaining

2009-01-30 Thread Bryan Duxbury

Did you publish those results anywhere?

On Jan 30, 2009, at 9:56 AM, Brian Bockelman wrote:

For what it's worth, our organization did extensive tests on many  
filesystems benchmarking their performance when they are 90 - 95%  
full.


Only XFS retained most of its performance when it was "mostly  
full" (ext4 was not tested)... so, if you are thinking of pushing  
things to the limits, that might be something worth considering.


Brian

On Jan 30, 2009, at 11:18 AM, stephen mulcahy wrote:



Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose  
of the reservation? Just to give root preference or leave wiggle  
room? If it's not strictly necessary it seems like it would make  
sense to reduce it to essentially 0%.


AFAIK It is needed for defragmentation / fsck to work properly and  
your filesystem performance will degrade a lot if you reduce this  
to 0% (but I'd love to hear otherwise :)


-stephen






Reducers stuck in Shuffle ...

2009-01-30 Thread Miles Osborne
i've been seeing a lot of jobs where large numbers of reducers keep
failing at the shuffle phase due to timeouts (see a sample reducer
syslog entry below).  our setup consists of 8-core machines, with one
box acting as both a slave and a namenode.  the load on the namenode
is not at full capacity so that doesn't appear to be the problem.  we
also run 0.18.1

reducers which run on the namenode are fine, it is only those running
on slaves which seem affected.

note that i seem to get this if i vary the number of reducers run, so
it doesn't appear to be a function of the shard size

is there some flag i should modify to increase the timeout value?  or,
is this fixed in the latest release?

(i found one thread on this which talked about DNS entries and another
which mentioned HADOOP-3155)

thanks

Miles
>
2009-01-30 10:26:14,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2009-01-30 10:26:14,229 INFO org.apache.hadoop.streaming.PipeMapRed:
PipeMapRed exec
[/disk2/hadoop/mapred/local/taskTracker/jobcache/job_200901301017_0001/attempt_200901301017_0001_r_11_0/work/./r-compute-ngram-counts]
2009-01-30 10:26:14,368 INFO org.apache.hadoop.mapred.ReduceTask:
ShuffleRamManager: MemoryLimit=78643200,
MaxSingleShuffleLimit=19660800
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 Thread started: Thread for
merging on-disk files
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 Thread waiting: Thread for
merging on-disk files
2009-01-30 10:26:14,488 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 Thread started: Thread for
merging in memory files
2009-01-30 10:26:14,489 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 Need another 3895 map output(s)
where 0 is already in progress
2009-01-30 10:26:14,495 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0: Got 6 new map-outputs & number
of known map outputs is 6
2009-01-30 10:26:14,496 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 Scheduled 1 of 6 known outputs (0
slow hosts and 5 dup hosts)
2009-01-30 10:26:44,566 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200901301017_0001_r_11_0 copy failed:
attempt_200901301017_0001_m_03_0 from crom.inf.ed.ac.uk
2009-01-30 10:26:44,567 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.SocketTimeoutException: connect timed out
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1296)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1290)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:944)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1143)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1084)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:997)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:946)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:152)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.(HttpClient.java:233)at
sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977)
... 4 more

2009-01-30 10:26:45,493 INFO org.apache.hadoop.mapred.ReduceTask: Task
att

Re: Question about HDFS capacity and remaining

2009-01-30 Thread Edward Capriolo
Very interesting note for a new cluster checklist. Good to tune the
file system down from 5%.

On a related note some operating systems ::cough:: FreeBSD will report
negative disk space when you go over the quota. What does that mean?

We run nagios with NRPE to run remote disk checks. We configure our
alarms to warn at 5%. Image a process keeps using a disk.

Alarm 5% , Alarm4%, Alarm 3%,Alarm 2%, Alarm 1%, Alarm cleared 103%.
NRPE assumes disk % 0-100% comically if you drop below 100% the
monitor thinks the disk is fine again. This is like a signed/unsigned
bug.

I mention this not to show off, As people deploy hadoop on other
platforms issues like this might crop up. I am not saying it is an
issue, but it could be.


Re: Question about HDFS capacity and remaining

2009-01-30 Thread Brian Bockelman
For what it's worth, our organization did extensive tests on many  
filesystems benchmarking their performance when they are 90 - 95% full.


Only XFS retained most of its performance when it was "mostly  
full" (ext4 was not tested)... so, if you are thinking of pushing  
things to the limits, that might be something worth considering.


Brian

On Jan 30, 2009, at 11:18 AM, stephen mulcahy wrote:



Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of  
the reservation? Just to give root preference or leave wiggle room?  
If it's not strictly necessary it seems like it would make sense to  
reduce it to essentially 0%.


AFAIK It is needed for defragmentation / fsck to work properly and  
your filesystem performance will degrade a lot if you reduce this to  
0% (but I'd love to hear otherwise :)


-stephen




Re: Question about HDFS capacity and remaining

2009-01-30 Thread Doug Cutting

Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of the 
reservation? Just to give root preference or leave wiggle room?


I think it's so that, when the disk is full, root processes don't fail, 
only user processes.  So you don't lose, e.g., syslog.  With modern 
disks, 5% is too much, especially for volumes that are only used for 
user data.  You can safely set this to 1%.


Doug


Seeing DiskErrorException, but no real error appears to be happening

2009-01-30 Thread John Lee
Folks,

I'm seeing a lot of the following exceptions in my Hadoop logs when I
run jobs under Hadoop 0.19. I dont recall seeing this in Hadoop 0.18:

org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200901131804_0215/attempt_200901131804_0215_r_09_0/output/file.out
in any of the configured local directories

My understanding is that this means the reduce outputs can't be found
for some reason. However, the jobs seem to complete successfully and
the output is fine. I've double checked my configuration and can't
find any errors or problems. Is this pretty normal behavior? Is there
anything that might cause this other than misconfiguration? I'm trying
to decide if a bug needs to be filed.

Thanks,
John


Re: Question about HDFS capacity and remaining

2009-01-30 Thread stephen mulcahy


Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of the 
reservation? Just to give root preference or leave wiggle room? If it's 
not strictly necessary it seems like it would make sense to reduce it to 
essentially 0%.


AFAIK It is needed for defragmentation / fsck to work properly and your 
filesystem performance will degrade a lot if you reduce this to 0% (but 
I'd love to hear otherwise :)


-stephen



Re: Question about HDFS capacity and remaining

2009-01-30 Thread Bryan Duxbury
Hm, very interesting. Didn't know about that. What's the purpose of  
the reservation? Just to give root preference or leave wiggle room?  
If it's not strictly necessary it seems like it would make sense to  
reduce it to essentially 0%.


-Bryan

On Jan 29, 2009, at 6:18 PM, Doug Cutting wrote:

Ext2 by default reserves 5% of the drive for use by root only.   
That'd be 45MB of your 907GB capacity which would account for most  
of the discrepancy.  You can adjust this with tune2fs.


Doug

Bryan Duxbury wrote:

There are no non-dfs files on the partitions in question.
df -h indicates that there is 907GB capacity, but only 853GB  
remaining, with 200M used. The only thing I can think of is the  
filesystem overhead.

-Bryan
On Jan 29, 2009, at 4:06 PM, Hairong Kuang wrote:

It's taken by non-dfs files.

Hairong


On 1/29/09 3:23 PM, "Bryan Duxbury"  wrote:


Hey all,

I'm currently installing a new cluster, and noticed something a
little confusing. My DFS is *completely* empty - 0 files in DFS.
However, in the namenode web interface, the reported "capacity" is
3.49 TB, but the "remaining" is 3.25TB. Where'd that .24TB go?  
There
are literally zero other files on the partitions hosting the DFS  
data

directories. Where am I losing 240GB?

-Bryan






problem with completion notification from block movement

2009-01-30 Thread Karl Kleinpaste
We have a small test cluster, a double master (NameNode+JobTracker) plus
2 slaves, running 0.18.1.  We are seeing an intermittent problem where
our application logs failures out of DFSClient, thus:

2009-01-30 01:59:42,072 WARN org.apache.hadoop.dfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_7603130349014268849_2349933java.net.SocketTimeoutException: 66000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.10.102:54700
remote=/10.0.10.108:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at java.io.DataInputStream.readLong(DataInputStream.java:380)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream
$ResponseProcessor.run(DFSClient.java:2044)

(Apologies for paste formatting.  I hate Evolution.)

Our application here is our "JobConsole," which is responsible for
taking notifications from an external data-generating application: The
external app scribbles files into DFS and then tells JobConsole about
them.  JobConsole submits jobs to crunch that data in response to the
external app's notifications of data availability.  JobConsole runs on
the master node.

Chasing that block identifier through our JobConsole log plus the
DataNode logs on the slaves, we have an odd timeline, which is this:
01:58:32slave (.108, above): receiving blk from master (.102)
01:58:35other slave (.107): receiving blk from .108
01:58:36.107: received blk
01:58:38.107: terminate PacketResponder
01:59:42JobConsole (.102): 66s t.o. + Error Recovery (above)
01:59:42.107: invoke recoverBlock on that blk
02:01:15.108: received blk + terminate PacketResponder
03:03:24.108: deleting blk, from Linux pathname in DFS storage

What's clear from this is that .108 got the block quickly, because it
was in a position immediately to send a copy to .107, which responded
promptly enough to say that it was in possession.  But .108's DataNode
sat on the block for a full 3 minutes before announcing what appears to
have been ordinary completion and responder termination.  After the
first minute-plus of that long period, JobConsole gave up and did a
recovery operation, which appears to work.  If .108's DataNode sent a
notification when it finally logged completed reception, no doubt there
was nobody listening for it any more.

What's particularly of interest to us is that the NameNode log shows us
that the data being moved is job.jar:

2009-01-30 01:58:32,353 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock: 
/usr/local/rinera/hadoop/hadoop-runtime/0/mapred/system/job_200901291752_3021/job.jar.
 blk_7603130349014268849_2349933

Note block name and timestamp.

Does anyone else have knowledge or history with such glitches?  We've
recently begun seeing a number of problems in communication between task
management processes and DFS that previously had not been seen, and
we're trying to nail down where they're coming from, without success.



Re: decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)

2009-01-30 Thread Bill Au
Alyssa,
 I am not trying to revive the dead node.  I want to permanently remove
a node from the cluster.  But after decommissioning it, it shows up as a
dead node until I restart the cluster.  I am looking for a way to get rid of
it from the dfshealth.jsp page without having to restart the cluster.

Bill

On Thu, Jan 29, 2009 at 5:45 PM, Hargraves, Alyssa  wrote:

> Bill-
>
> I believe once the node is decommissioned you'll also have to run
> bin/hadoop-daemon.sh start datanode and bin/hadoop-daemon.sh start
> tasktracker (both run on the slave node, not master) to revive the dead
> node.  Just removing it from exclude and refreshing doesn't work for me
> either, but with those two additional commands it does.
>
> - Alyssa
> 
> From: Bill Au [bill.w...@gmail.com]
> Sent: Thursday, January 29, 2009 5:40 PM
> To: core-user@hadoop.apache.org
> Subject: Re: decommissioned node showing up ad dead node in web based
> interface to namenode (dfshealth.jsp)
>
> Not sure why but this does not work for me.  I am running 0.18.2.  I ran
> hadoop dfsadmin -refreshNodes after removing the decommissioned node from
> the exclude file.  It still shows up as a dead node.  I also removed it
> from
> the slaves file and ran the refresh nodes command again.  It still shows up
> as a dead node after that.
>
> I am going to upgrade to 0.19.0 to see if it makes any difference.
>
> Bill
>
> On Tue, Jan 27, 2009 at 7:01 PM, paul  wrote:
>
> > Once the nodes are listed as dead, if you still have the host names in
> your
> > conf/exclude file, remove the entries and then run hadoop dfsadmin
> > -refreshNodes.
> >
> >
> > This works for us on our cluster.
> >
> >
> >
> > -paul
> >
> >
> > On Tue, Jan 27, 2009 at 5:08 PM, Bill Au  wrote:
> >
> > > I was able to decommission a datanode successfully without having to
> stop
> > > my
> > > cluster.  But I noticed that after a node has been decommissioned, it
> > shows
> > > up as a dead node in the web base interface to the namenode (ie
> > > dfshealth.jsp).  My cluster is relatively small and losing a datanode
> > will
> > > have performance impact.  So I have a need to monitor the health of my
> > > cluster and take steps to revive any dead datanode in a timely fashion.
> >  So
> > > is there any way to altogether "get rid of" any decommissioned datanode
> > > from
> > > the web interace of the namenode?  Or is there a better way to monitor
> > the
> > > health of the cluster?
> > >
> > > Bill
> > >
> >
>


Re: How does Hadoop choose machines for Reducers?

2009-01-30 Thread jason hadoop
Hadoop just distributes to the available reduce execution slots. I don't
believe it pays attention to what machine they are on.
I believe the plan is to take account data locality in future (ie:
distribute tasks to machines that are considered more topologically close to
their input split first, but I don't think this is available to most users.)


On Thu, Jan 29, 2009 at 7:05 PM, Nathan Marz  wrote:

> I have a MapReduce application in which I configure 16 reducers to run on
> 15 machines. My mappers output exactly 16 keys, IntWritable's from 0 to 15.
> However, only 12 out of the 15 machines are used to run the 16 reducers (4
> machines have 2 reducers running on each). Is there a way to get Hadoop to
> use all the machines for reducing?
>


Re: Hadoop & perl

2009-01-30 Thread Keita Higashi

Hello!!

Do you know existence of hadoop-streaming.jar?
Recommend that you use hadoop-streaming.jar if you do not know.
The usage of hadoop-streaming.jar is written on 
http://hadoop.apache.org/core/docs/r0.18.3/streaming.html.


ex:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar  \
-input httpd_logs \
-output logc_output \
-mapper /home/hadoop/work/hadoop/analog/map.pl \
-reducer /home/hadoop/work/hadoop/analog/reduce.pl \
-inputformat TextInputFormat \
-outputformat TextOutputFormat


Thank you.


- Original Message - 
From: "Daren" 

To: 
Sent: Friday, January 30, 2009 7:53 PM
Subject: Hadoop & perl



Just started using hadoop and want to use perl to interface with it.

Thriftfs has some perl modules which claim to be able to work with the 
thrift server !


Unfortunately I have'nt been able to get this to work and was wondering if 
anyone out there can give me some advice as to how to get a perl interface 
to work, if indeeed it's possible ???


da...@adestra.com 




Re: How To Encrypt Hadoop Socket Connections

2009-01-30 Thread Darren Govoni
One alternative might be to use openvpn and bind the hadoop services to
the private VPN interface address openvpn assigns the machine. All
traffic over that IP address is thus encrypted and secured.

On Fri, 2009-01-30 at 09:24 -0500, Brian MacKay wrote:
> Hello,
> 
> Found some archive posts regarding "encrypt Hadoop socket connections"
> 
> https://issues.apache.org/jira/browse/HADOOP-2239
> 
> http://markmail.org/message/pmn23y4b3gdxcpif
> 
> Couldn't find any documentation or Junit tests.  Does anyone know the
> proper configuration changes to make?
> 
> It seems like the following are needed in hadoop-site.xml?
> 
> https.keystore.info.rsrc  = should refernce an external config file, in
> this example called sslinfo.xml ?
> 
> https.keystore.password  =  ?
> https.keystore.keypassword  = ?
> 
> ---
> Snippet from org.apache.hadoop.dfs.DataNode
> 
>   void startDataNode(Configuration conf, 
>  AbstractList dataDirs
>  ) throws IOException {
> 
>...
>sslConf.addResource(conf.get("https.keystore.info.rsrc",
> "sslinfo.xml"));
> String keyloc = sslConf.get("https.keystore.location");
> if (null != keyloc) {
>   this.infoServer.addSslListener(secInfoSocAddr, keyloc,
>   sslConf.get("https.keystore.password", ""),
>   sslConf.get("https.keystore.keypassword", ""));
> --
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> 
> The information transmitted is intended only for the person or entity to 
> which it is addressed and may contain confidential and/or privileged 
> material. Any review, retransmission, dissemination or other use of, or 
> taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this message in error, please contact the sender and delete the material 
> from any computer.
> 
> 



How To Encrypt Hadoop Socket Connections

2009-01-30 Thread Brian MacKay
Hello,

Found some archive posts regarding "encrypt Hadoop socket connections"

https://issues.apache.org/jira/browse/HADOOP-2239

http://markmail.org/message/pmn23y4b3gdxcpif

Couldn't find any documentation or Junit tests.  Does anyone know the
proper configuration changes to make?

It seems like the following are needed in hadoop-site.xml?

https.keystore.info.rsrc  = should refernce an external config file, in
this example called sslinfo.xml ?

https.keystore.password  =  ?
https.keystore.keypassword  = ?

---
Snippet from org.apache.hadoop.dfs.DataNode

  void startDataNode(Configuration conf, 
 AbstractList dataDirs
 ) throws IOException {

   ...
   sslConf.addResource(conf.get("https.keystore.info.rsrc",
"sslinfo.xml"));
String keyloc = sslConf.get("https.keystore.location");
if (null != keyloc) {
  this.infoServer.addSslListener(secInfoSocAddr, keyloc,
  sslConf.get("https.keystore.password", ""),
  sslConf.get("https.keystore.keypassword", ""));
--
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.




local path

2009-01-30 Thread Hakan Kocakulak
Hello,
How can I write and read from directly to the datanode's local path?

Thanks,
Hakan


Hadoop & perl

2009-01-30 Thread Daren

Just started using hadoop and want to use perl to interface with it.

Thriftfs has some perl modules which claim to be able to work with the 
thrift server !


Unfortunately I have'nt been able to get this to work and was wondering 
if anyone out there can give me some advice as to how to get a perl 
interface to work, if indeeed it's possible ???


da...@adestra.com


Re: [ANNOUNCE] Hadoop release 0.18.3 available

2009-01-30 Thread Amareshwari Sriramadasu

Anum Ali wrote:

Hi,


Need some kind of guidance related to started with Hadoop Installation and
system setup. Iam newbie regarding to Hadoop . Our system OS is Fedora 8,
should I start from a stable release of Hadoop or get it from svn developing
version (from contribute site).



Thank You



  

Download a stable release from http://hadoop.apache.org/core/releases.html
For installation and setup, You can see 
http://hadoop.apache.org/core/docs/current/quickstart.html and 
http://hadoop.apache.org/core/docs/current/cluster_setup.html


-Amareshwari

On Thu, Jan 29, 2009 at 7:38 PM, Nigel Daley  wrote:

  

Release 0.18.3 fixes many critical bugs in 0.18.2.

For Hadoop release details and downloads, visit:
http://hadoop.apache.org/core/releases.html

Hadoop 0.18.3 Release Notes are at
http://hadoop.apache.org/core/docs/r0.18.3/releasenotes.html

Thanks to all who contributed to this release!

Nigel