Re: Reducer stuck at pending state

2010-02-17 Thread Song Liu
Hi Todd,I'm using hadoop 0.20.1, apache distribution.
I didnt set the property you mentioned and I think they should remain
default (1G?).

The cluster I'm playing with has four master nodes, and 96 slave nodes
physically. Hadoop uses one master node for namenode and jobstracker, and
picks 12 nodes for its data and tasktrackers.

Interestingly, I noticed the hardware specification is a liltle different
between master and slave mahchines. So I changed the namenode and
jobstracker to one of the slaves. The problem seems solved. (My program runs
normally SO FAR)

However, I cannot find the concrete hardware configuration for each node,
but I guess the differences should exist mainly on the CPUs or RAMs.

These are copied from the cluster's specification manual:

Slaves:

each with two 2.6 GHz dual-core opteron processors, 8 GB RAM, 16 GB swap
space and 50 GB of local scratch space

Masters:

each with four 2.6 GHz dual-core opteron processors, 32 GB RAM, 64 GB swap
space, 64 GB of local scratch space

Can you see what the problem is?

Thanks a lot.
Regards
Song Liu

On Wed, Feb 17, 2010 at 4:18 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Song,

 What version are you running? How much memory have you allocated to
 the reducers in mapred.child.java.opts?

 -Todd

 On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote:
  Sorry, seems no attachment is allowed, I paste it here:
 
  JobidPriorityUserNameMap % CompleteMap TotalMaps
  CompletedReduce % CompleteReduce TotalReduces Completed
  Job
  Scheduling Information
  job_2... NORMAL  sl9885TF/IDF 100.00%  26
  260.00% 10
 NA
  job_2... NORMAL  sl9885Rank100.00%  2222
 0.00% 10NA
  job_2... NORMAL  sl9885TF/IDF 100.00%  20
  200.00%10
  NA
 
  The format is horrible, sorry for that, but it's the best I can do :(
 
  BTW, I guess it should not be my program's problem, since I have tested
 it
  on some other clusters before.
 
  Regards
  Song Liu
 
  On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com
 wrote:
 
  Hi all, I recently have me t a problem that sometimes, reducer hang up
 at
  pending state, with 0% complete.
 
  It seems all the mappers are completely done, and when it just about to
  start the reducer, the reducer stuck, without any given warnings and
 errors
  and was staying at the pending state.
 
  I have a cluster with 12 nodes. But this situation only appears when the
  scale of data is large (2GB or more), smaller cases never met this
 problem.
 
  Any one has met this issue before? I searched JIRA, some one proposed
 this
  issue before, but no solution was given. (
 
 https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230
  )
 
  The typical case of this issue is captured in the attachment.
 
  Regards
  Song Liu
 
 



Need your Help sir

2010-02-17 Thread tiru murugan
Dear sir,

i want your help, i want to deploy hadoop core using eclipse. now
hadoop-core divided hadoop -common , hadoop -hdfs, hadoop-mapreduce.  i
could tried many times, but hadoop-common and hadoop-mapreduce is build
successfully , then hadoop-hdfs also build successfully.
my doudht is when i build hadoop-mapreduce project, that project
automaticaly created this jar file hadoop-mapred-0.22.0-SNAPSHOT.jar. As
soon as i build hadoop-hdfs project , that project did not create jar file
inside the build folder?

please help me

Thank you


with Regards
VTM


Issue with Hadoop cluster on Amazon ec2

2010-02-17 Thread viral shah
Hi,

We have deployed hadoop cluster on EC2, hadoop version 0.20.1.
We are having couple of data nodes.
We want to get some files from the data node which is there on the amazon
ec2 instance to our local instance using java application, which in turn use
SequentialFile.reader to read file.
The problem is amazon uses private IP for host communication, but to connect
form the environment other than amazon we will be using public IP.
So when we try to connect to the data nodes via name node, it will report
data node's private IP and using the same we are not able to reach the data
node.
Is there any way we can set name node to send data nodes public NAT IP not
the internal IP, or any other work around is there to overcome this problem.

Thanks
Viral.


Re: Issue with Hadoop cluster on Amazon ec2

2010-02-17 Thread Steve Loughran

viral shah wrote:

Hi,

We have deployed hadoop cluster on EC2, hadoop version 0.20.1.
We are having couple of data nodes.
We want to get some files from the data node which is there on the amazon
ec2 instance to our local instance using java application, which in turn use
SequentialFile.reader to read file.
The problem is amazon uses private IP for host communication, but to connect
form the environment other than amazon we will be using public IP.
So when we try to connect to the data nodes via name node, it will report
data node's private IP and using the same we are not able to reach the data
node.


That's a feature to stop you accidentally exporting your entire HDFS 
filesystem to the rest of the world.



Is there any way we can set name node to send data nodes public NAT IP not
the internal IP, or any other work around is there to overcome this problem.


-push up the data to the s3 filestore first, have the job sequence start 
from s3 and finish there too




Difficulty connecting Hadoop JMX service

2010-02-17 Thread viral shah
I want to monitor my hadoop cluster services using check_jmx nagios plugin.
I use following env. variables in the hadoop-env.sh file
export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false”
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote
$HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote
$HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote
$HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote
$HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote
$HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″
export HADOOP_TASKTRACKER_OPTS=”-Dcom.sun.management.jmxremote.port=8009″

but the problem I am facing is that my hadoop machine is behind firewall and
I can't open multiple ports. The JMX RMI connector opens two ports: one is
for the RMI registry, and it's the port that you usually supply with the
-Dcom.sun.management.jmxremote.port=port property. The other port is used
to export JMX RMI connection objects. This second port is usually
dynamically allocated at random. So I am not able to connect using Jconsole
or check_jmx plugin.

I tried using example provided at *
http://blogs.sun.com/jmxetc/entry/connecting_through_firewall_using_jmx*, by
changing env. variable like this export
HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote
-Dexample.rmi.agent.port=3000
-javaagent:/root/install/asl-hadoop-0.20.1/lib/CustomAgent.jar
$HADOOP_NAMENODE_OPTS
The CustomAgent.jar file I created using above mentioned blog entry.
Then when I start hadoop cluster using bin/start-all.sh I get following
error.

root/install/asl-hadoop-0.20.1/bin/hadoop-daemon.sh: line 96:  8983
Aborted nohup nice -n $HADOOP_NICENESS
$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR $command $@  $log
21  /dev/null
Create RMI registry on port 3000
Get the platform's MBean server
Initialize the environment map
Create an RMI connector server
Start the RMI connector server on port 3000
service:jmx:rmi://domU-12-31-38-00-B4-F8:3000/jndi/rmi://domU-12-31-38-00-B4-F8:3000/jmxrmi
Create RMI registry on port 3000
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
localhost: starting datanode, logging to
/root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-datanode-domU-12-31-38-00-B4-F8.out
localhost: starting secondarynamenode, logging to
/root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-secondarynamenode-domU-12-31-38-00-B4-F8.out
starting jobtracker, logging to
/root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-jobtracker-domU-12-31-38-00-B4-F8.out
localhost: starting tasktracker, logging to
/root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-tasktracker-domU-12-31-38-00-B4-F8.out

can some help me, what I am doing wrong.

Thanks,
Viral.


Re: Difficulty connecting Hadoop JMX service

2010-02-17 Thread Edward Capriolo
On Wed, Feb 17, 2010 at 11:22 AM, viral shah viral21...@gmail.com wrote:
 I want to monitor my hadoop cluster services using check_jmx nagios plugin.
 I use following env. variables in the hadoop-env.sh file
 export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=false
 -Dcom.sun.management.jmxremote.ssl=false”
 # Command specific options appended to HADOOP_OPTS when specified
 export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote
 $HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″
 export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote
 $HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″
 export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote
 $HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″
 export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote
 $HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″
 export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote
 $HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″
 export HADOOP_TASKTRACKER_OPTS=”-Dcom.sun.management.jmxremote.port=8009″

 but the problem I am facing is that my hadoop machine is behind firewall and
 I can't open multiple ports. The JMX RMI connector opens two ports: one is
 for the RMI registry, and it's the port that you usually supply with the
 -Dcom.sun.management.jmxremote.port=port property. The other port is used
 to export JMX RMI connection objects. This second port is usually
 dynamically allocated at random. So I am not able to connect using Jconsole
 or check_jmx plugin.

 I tried using example provided at *
 http://blogs.sun.com/jmxetc/entry/connecting_through_firewall_using_jmx*, by
 changing env. variable like this export
 HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote
 -Dexample.rmi.agent.port=3000
 -javaagent:/root/install/asl-hadoop-0.20.1/lib/CustomAgent.jar
 $HADOOP_NAMENODE_OPTS
 The CustomAgent.jar file I created using above mentioned blog entry.
 Then when I start hadoop cluster using bin/start-all.sh I get following
 error.

 root/install/asl-hadoop-0.20.1/bin/hadoop-daemon.sh: line 96:  8983
 Aborted                 nohup nice -n $HADOOP_NICENESS
 $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR $command $@  $log
 21  /dev/null
 Create RMI registry on port 3000
 Get the platform's MBean server
 Initialize the environment map
 Create an RMI connector server
 Start the RMI connector server on port 3000
 service:jmx:rmi://domU-12-31-38-00-B4-F8:3000/jndi/rmi://domU-12-31-38-00-B4-F8:3000/jmxrmi
 Create RMI registry on port 3000
 Exception in thread main java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 localhost: starting datanode, logging to
 /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-datanode-domU-12-31-38-00-B4-F8.out
 localhost: starting secondarynamenode, logging to
 /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-secondarynamenode-domU-12-31-38-00-B4-F8.out
 starting jobtracker, logging to
 /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-jobtracker-domU-12-31-38-00-B4-F8.out
 localhost: starting tasktracker, logging to
 /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-tasktracker-domU-12-31-38-00-B4-F8.out

 can some help me, what I am doing wrong.

 Thanks,
 Viral.


Yikes,

That is a rather hairy problem.
One possible work around, if you use SSL you might sidestep the RMI
issues. (Do not quote me on that)
http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html

I always chose the course of doing my monitoring from the same subnet
as the cluster to forgo the NAT issues, please do post your findings
as this would be helpful to all doing jmx monitoring.


Re: Hadoop automatic job status check and notification?

2010-02-17 Thread jiang licht
Amogh, this really helps me a lot! Thanks!

So, in summary, I guess there are the following options to do job notification 
or more generally job management stuff. I also guess Oozie / cascading is the 
better choice when we need to handle these externally. Anyway, without deep 
exploration of all these options, I certainly may have misunderstandings. 
Correct me please :)

- Prepare some external script and poll job status by communicating with hadoop 
job [-list | -status | etc.] at a regular pace and take actions accordingly. 
(pros: simple, cons: need to poll status, not event-driven )

- Within a hadoop job written in java, make calls to appropriate job control 
functions to send out job status message if want. (pros: straightforward, cons: 
only for jobs in java)

- Use Oozie / cascading to organize flow of hadoop jobs and other housekeeping 
job (e.g. pull back results, cleanup, shutdown clusters, and re-execute jobs 
against failure, etc.) (pros: powerful, can handle job control outside of jobs 
written in java/pig, cons: learning curve?)

- Embedded pig (pros: works for jobs in pig scripts, cons: works for jobs in 
pig scripts)

- What else?

--
Michael

--- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote:

From: Amogh Vasekar am...@yahoo-inc.com
Subject: Re: Hadoop automatic job status check and notification?
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Wednesday, February 17, 2010, 2:45 AM

Hi,
In our case we launched Pig from perl script and handled re-execution, clean-up 
etc. from there. If you need to implement a workflow or DAG like model, 
consider looking at Oozie / cascading. If you are interested in diving little 
deeper, you can try embedded pig.

Amogh


On 2/17/10 1:53 PM, jiang licht licht_ji...@yahoo.com wrote:

Thanks Amogh.

So, I think the following will do the job:
public void setJobEndNotificationURI(String uri)But what about hadoop jobs 
written in PIG scripts? Since PIG will take control, is there some convenient  
way to do the same thing as well?

Thanks!
--
Michael

--- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote:

From: Amogh Vasekar am...@yahoo-inc.com
Subject: Re: Hadoop automatic job status check and notification?
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Wednesday, February 17, 2010, 12:44 AM

Hi,
When you submit a job to the cluster, you can control the behavior for blocking 
/ return using JobClient's submitJob, runJob methods. It will also let you know 
if the job was successful or failed, so you can design your follow up scripts 
accordingly.


Amogh


On 2/17/10 11:01 AM, jiang licht licht_ji...@yahoo.com wrote:

New to Hadoop (now using 0.20.1), I want to do the following:

Automatic status check and notification of hadoop jobs such that e.g. when a 
job is finished, a script can be trigged so that job results can be 
automatically pulled back to local machines and expensive hadoop cluster can be 
released or shutdown.

So, what is the best way to do this?

Thanks!
--
Michael












  

Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-17 Thread himanshu chandola
Haven't seen the part 2. I think this was complete.


 Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.



- Original Message 
From: jiang licht licht_ji...@yahoo.com
To: common-user@hadoop.apache.org
Sent: Wed, February 17, 2010 3:26:26 AM
Subject: Re: LZO compression for Map output in Hadoop 0.20+?

Thanks Himanshu. Is there a part 2?

--
Michael

--- On Tue, 2/16/10, himanshu chandola himanshu_cool...@yahoo.com wrote:

From: himanshu chandola himanshu_cool...@yahoo.com
Subject: Re: LZO compression for Map output in Hadoop 0.20+?
To: common-user@hadoop.apache.org
Date: Tuesday, February 16, 2010, 11:35 PM

You might want to check out this:

http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.



- Original Message 
From: jiang licht licht_ji...@yahoo.com
To: common-user@hadoop.apache.org
Sent: Wed, February 17, 2010 12:26:48 AM
Subject: LZO compression for Map output in Hadoop 0.20+?

   New to Hadoop (now using 0.20.1), I want to know how to choose and set up 
compression methods for Map output, especially how to configure and use LZO 
compression?

  Specifically, please share your experience for the following 2 scenarios. 
Thanks!

   (1) Is there a global setting in some hadoop configuration files for naming 
a compression method (e.g. LZO) such that it will be used to compress Map 
output by default? and how?

   (2) How to use a compression method (e.g. LZO) in java code (I noticed that 
in javadoc, org.apache.hadoop.mapred is labeld Deprecated)?

Thanks!
--
Michael


  


Re: Hadoop automatic job status check and notification?

2010-02-17 Thread Edward Capriolo
On Wed, Feb 17, 2010 at 1:03 PM, jiang licht licht_ji...@yahoo.com wrote:
 Amogh, this really helps me a lot! Thanks!

 So, in summary, I guess there are the following options to do job 
 notification or more generally job management stuff. I also guess Oozie / 
 cascading is the better choice when we need to handle these externally. 
 Anyway, without deep exploration of all these options, I certainly may have 
 misunderstandings. Correct me please :)

 - Prepare some external script and poll job status by communicating with 
 hadoop job [-list | -status | etc.] at a regular pace and take actions 
 accordingly. (pros: simple, cons: need to poll status, not event-driven )

 - Within a hadoop job written in java, make calls to appropriate job control 
 functions to send out job status message if want. (pros: straightforward, 
 cons: only for jobs in java)

 - Use Oozie / cascading to organize flow of hadoop jobs and other 
 housekeeping job (e.g. pull back results, cleanup, shutdown clusters, and 
 re-execute jobs against failure, etc.) (pros: powerful, can handle job 
 control outside of jobs written in java/pig, cons: learning curve?)

 - Embedded pig (pros: works for jobs in pig scripts, cons: works for jobs in 
 pig scripts)

 - What else?

 --
 Michael

 --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote:

 From: Amogh Vasekar am...@yahoo-inc.com
 Subject: Re: Hadoop automatic job status check and notification?
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date: Wednesday, February 17, 2010, 2:45 AM

 Hi,
 In our case we launched Pig from perl script and handled re-execution, 
 clean-up etc. from there. If you need to implement a workflow or DAG like 
 model, consider looking at Oozie / cascading. If you are interested in diving 
 little deeper, you can try embedded pig.

 Amogh


 On 2/17/10 1:53 PM, jiang licht licht_ji...@yahoo.com wrote:

 Thanks Amogh.

 So, I think the following will do the job:
 public void setJobEndNotificationURI(String uri)But what about hadoop jobs 
 written in PIG scripts? Since PIG will take control, is there some 
 convenient  way to do the same thing as well?

 Thanks!
 --
 Michael

 --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote:

 From: Amogh Vasekar am...@yahoo-inc.com
 Subject: Re: Hadoop automatic job status check and notification?
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date: Wednesday, February 17, 2010, 12:44 AM

 Hi,
 When you submit a job to the cluster, you can control the behavior for 
 blocking / return using JobClient's submitJob, runJob methods. It will also 
 let you know if the job was successful or failed, so you can design your 
 follow up scripts accordingly.


 Amogh


 On 2/17/10 11:01 AM, jiang licht licht_ji...@yahoo.com wrote:

 New to Hadoop (now using 0.20.1), I want to do the following:

 Automatic status check and notification of hadoop jobs such that e.g. when a 
 job is finished, a script can be trigged so that job results can be 
 automatically pulled back to local machines and expensive hadoop cluster can 
 be released or shutdown.

 So, what is the best way to do this?

 Thanks!
 --
 Michael

Michael,

That is a pretty good summary.

Ozzie, cascading, are much more advanced work flow schedulers.

For reference, I use the JobClient object
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html
to poll the jobtracker and gather the information for these graphs.

http://www.jointhegrid.com/hadoop-cacti-jtg-walk/running_job.jsp
http://www.jointhegrid.com/hadoop-cacti-jtg-walk/maps_v_reduces.jsp

This is fairly easy to do. After you get connected, you have methods
like getAllJobs() or getJobById(String s) and can further interrogate
the return objects for the information you want. In my case I am
determining what state the jobs are in to draw a graph.

Automatic status check and notification of hadoop jobs such that e.g. when a 
job is finished, a script can be trigged so that job results can be 
automatically pulled back to local machines and expensive hadoop cluster can 
be released or shutdown.

Based on this requirement, you could also just handle the return code
in the driver of your map reduce program and take action. javax.mail,
messagebroker, etc.


Re: Why is $JAVA_HOME/lib/tools.jar in the classpath?

2010-02-17 Thread Aaron Kimball
Thomas,

What version of Hadoop are you building Debian packages for? If you're
taking Cloudera's existing debs and modifying them, these include a backport
of Sqoop (from Apache's trunk) which uses the rt tools.jar to compile
auto-generated code at runtime. Later versions of Sqoop (including the one
in the most recently-released CDH2: 0.20.1+169.56-1) include MAPREDUCE-1146
which eliminates that dependency.

- Aaron

On Tue, Feb 16, 2010 at 3:19 AM, Steve Loughran ste...@apache.org wrote:

 Thomas Koch wrote:

 Hi,

 I'm working on the Debian package for hadoop (the first version is already
 in the new queue for Debian unstable).
 Now I stumpled about $JAVA_HOME/lib/tools.jar in the classpath. Since
 Debian supports different JAVA runtimes, it's not that easy to know, which
 one the user currently uses and therefor I'd would make things easier if
 this jar would not be necessary.
 From searching and inspecting the SVN history I got the impression, that
 this is an ancient legacy that's not necessary (anymore)?


 I don't think hadoop core/hdf/maperd needs it. The only place where it
 would be needed is JSP-java-binary work, but as the JSPs are precompiled
 you can probably get away without it. Just add tests for all the JSPs to
 make sure they work.


 -steve



Question about Join.java example

2010-02-17 Thread Raymond Jennings III
Is there a typo in the Join.java example that comes with hadoop?  It has the 
line:

JobConf jobConf = new JobConf(getConf(), Sort.class);

Shouldn't that be Join.class ?  Is there an equivalent example that uses the 
later API instead of the deprecated calls?


  


Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-17 Thread Arun C Murthy

Use the following knobs:
mapred.compress.map.output = true
mapred.map.output.compression.codec =   
org.apache.hadoop.io.compress.LzoCodec


or call

jobConf.setMapOutputCompressorClass(LzoCodec.class);


You will need the native hadoop-gpl-compression library installed on  
all machines from http://code.google.com/p/hadoop-gpl-compression/


Arun


On Feb 16, 2010, at 9:26 PM, jiang licht wrote:

   New to Hadoop (now using 0.20.1), I want to know how to choose  
and set up compression methods for Map output, especially how to  
configure and use LZO compression?


  Specifically, please share your experience for the following 2  
scenarios. Thanks!


   (1) Is there a global setting in some hadoop configuration files  
for naming a compression method (e.g. LZO) such that it will be used  
to compress Map output by default? and how?


   (2) How to use a compression method (e.g. LZO) in java code (I  
noticed that in javadoc, org.apache.hadoop.mapred is labeld  
Deprecated)?


Thanks!
--
Michael







Re: MiniDFSCluster accessed via hdfs:// URL

2010-02-17 Thread Jason Rutherglen
Philip,

Thanks... I examined your patch, however I don't see the difference
between it and what I've got currently which is:

Configuration conf = new Configuration();
MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null);
URI uri = dfs.getFileSystem().getUri();
System.out.println(uri: + uri);

What could be the difference?

Jason

On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com wrote:
 It is, though you have to ask it what port it's running.  See the patch in
 https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does
 that.

 -- Philip

 On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
 because it seems to not work...




Re: MiniDFSCluster accessed via hdfs:// URL

2010-02-17 Thread Jason Rutherglen
Ok, I got this working... Thanks Philip!

On Wed, Feb 17, 2010 at 4:01 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Philip,

 Thanks... I examined your patch, however I don't see the difference
 between it and what I've got currently which is:

 Configuration conf = new Configuration();
 MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null);
 URI uri = dfs.getFileSystem().getUri();
 System.out.println(uri: + uri);

 What could be the difference?

 Jason

 On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com wrote:
 It is, though you have to ask it what port it's running.  See the patch in
 https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does
 that.

 -- Philip

 On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
 because it seems to not work...





Re: MiniDFSCluster accessed via hdfs:// URL

2010-02-17 Thread Philip Zeyliger
Out of curiosity, what was the crux of the problem?

-- Philip

On Wed, Feb 17, 2010 at 4:17 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Ok, I got this working... Thanks Philip!

 On Wed, Feb 17, 2010 at 4:01 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Philip,
 
  Thanks... I examined your patch, however I don't see the difference
  between it and what I've got currently which is:
 
  Configuration conf = new Configuration();
  MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null);
  URI uri = dfs.getFileSystem().getUri();
  System.out.println(uri: + uri);
 
  What could be the difference?
 
  Jason
 
  On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com
 wrote:
  It is, though you have to ask it what port it's running.  See the patch
 in
  https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that
 does
  that.
 
  -- Philip
 
  On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
  because it seems to not work...
 
 
 



Re: Pass the TaskId from map to Reduce

2010-02-17 Thread Don Bosco

Hi Ankit,
For your problem, you can use getJobId(); in reduce(), then you will have
the unique name and you can process the file in the map reduce.

ANKITBHATNAGAR wrote:
 
 Hi,
 
 I was working on a scenario where in I am generating a file in close()
 function of my Map implementation.
 
 Since Map execution are worked concurrently, this file is overwritten.
 
 I was wondering how to name this file uniquely per map execution basic and
 then read in configure() function of reduce.
 
 I could give a task id as name of the file but dont know how will I read
 the same file in configure() as the task id would have changed
 
 Ankit
 

-- 
View this message in context: 
http://old.nabble.com/Pass-the-TaskId-from-map-to-Reduce-tp27575531p27633914.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Pass the TaskId from map to Reduce

2010-02-17 Thread ANKITBHATNAGAR

Hi Don,
Thanks for your reply.
I already tried this approach, however the the issue that i am facing that I
was expecting all the maps to finish before any reduce starts.This is not
happening for me.
It looks like as one map finishes reduce starts.
Thats why I called close().?
Could you tell me when is closed function called after every map or after
all the maps?

Am I doing something wrong?


Thanks
Ankit
-- 
View this message in context: 
http://old.nabble.com/Pass-the-TaskId-from-map-to-Reduce-tp27575531p27634001.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Hi, I've tried posting this to Cloudera's community support site, but
the community website getsatisfaction.com returns various server
errors at the moment.  I believe the following is an issue related to
my environment within Cloudera's Training virtual machine.

Despite having success running Hadoop streaming on other Hadoop
clusters and on Cloudera's Training VM in local mode, I'm currently
getting an error when attempting to run a simple Hadoop streaming job
in the normal queue based mode on the Training VM.  I'm thinking the
error described below is an issue related to the worker node not
recognizing the python reference in the script's top shebang line.

The hadoop command I am executing is:

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output

Where the test_input directory contains 3 UNIX formatted, single line files:

training-vm: 3$ hadoop dfs -ls /user/training/test_input/
Found 3 items
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file1
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file2
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file3

training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
test_line1
test_line2
test_line3

And where blah.py looks like (UNIX formatted):

#!/usr/bin/python
import sys
for line in sys.stdin:
   print line

The resulting Hadoop-Streaming error is:

java.io.IOException: Cannot run program blah.py:
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
   ...


I get the same error when placing the python script on the HDFS, and
then using this in the hadoop command:

... -mapper hdfs:///user/training/blah.py ...


One suggestion found online, which may not be relevant to Cloudera's
distribution, mentions that the first line of the hadoop-streaming
python script (the shebang line) may not describe an applicable path
for the system.  The solution mentioned is to use: ... -mapper python
blah.py  ... in the Hadoop streaming command.  This doesn't seem to
work correctly for me, since I find that the lines from the input data
files are also parsed by the Python interpreter.  But this does reveal
that python is available on the worker node when using this technique.
 I have also tried without success the '-mapper blah.py' technique
using shebang lines: #!/usr/bin/env python, although on the training
VM Python is installed under /usr/bin/python.

Maybe the issue is something else.  Any suggestions or insights will be helpful.


Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Todd Lipcon
Are you passing the python script to the cluster using the -file
option? eg -mapper foo.py -file foo.py

Thanks
-Todd

On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.



Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Yes, I have tried that when passing the script.  Just now I tried:

hadoop jar 
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output -file blah.py

And got this error for a map task:

java.io.IOException: Cannot run program blah.py:
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
...

-Dan


On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote:
 Are you passing the python script to the cluster using the -file
 option? eg -mapper foo.py -file foo.py

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.




Re: Hadoop Streaming File-not-found error on Cloudera's training VM

2010-02-17 Thread Dan Starr
Todd, Thanks!
This solved it.

-Dan

On Wed, Feb 17, 2010 at 8:00 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi Dan,

 This is actually a bug in the release you're using. Please run:

 $ sudo apt-get update
 $ sudo apt-get install hadoop-0.20

 Then restart the daemons (or the entire VM) and give it another go.

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr dsta...@gmail.com wrote:
 Yes, I have tried that when passing the script.  Just now I tried:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output -file blah.py

 And got this error for a map task:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
        at 
 org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
        at 
 org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
        ...

 -Dan


 On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote:
 Are you passing the python script to the cluster using the -file
 option? eg -mapper foo.py -file foo.py

 Thanks
 -Todd

 On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote:
 Hi, I've tried posting this to Cloudera's community support site, but
 the community website getsatisfaction.com returns various server
 errors at the moment.  I believe the following is an issue related to
 my environment within Cloudera's Training virtual machine.

 Despite having success running Hadoop streaming on other Hadoop
 clusters and on Cloudera's Training VM in local mode, I'm currently
 getting an error when attempting to run a simple Hadoop streaming job
 in the normal queue based mode on the Training VM.  I'm thinking the
 error described below is an issue related to the worker node not
 recognizing the python reference in the script's top shebang line.

 The hadoop command I am executing is:

 hadoop jar 
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
 -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
 -input test_input/* -output output

 Where the test_input directory contains 3 UNIX formatted, single line 
 files:

 training-vm: 3$ hadoop dfs -ls /user/training/test_input/
 Found 3 items
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file1
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file2
 -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
 /user/training/test_input/file3

 training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
 test_line1
 test_line2
 test_line3

 And where blah.py looks like (UNIX formatted):

 #!/usr/bin/python
 import sys
 for line in sys.stdin:
    print line

 The resulting Hadoop-Streaming error is:

 java.io.IOException: Cannot run program blah.py:
 java.io.IOException: error=2, No such file or directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
    ...


 I get the same error when placing the python script on the HDFS, and
 then using this in the hadoop command:

 ... -mapper hdfs:///user/training/blah.py ...


 One suggestion found online, which may not be relevant to Cloudera's
 distribution, mentions that the first line of the hadoop-streaming
 python script (the shebang line) may not describe an applicable path
 for the system.  The solution mentioned is to use: ... -mapper python
 blah.py  ... in the Hadoop streaming command.  This doesn't seem to
 work correctly for me, since I find that the lines from the input data
 files are also parsed by the Python interpreter.  But this does reveal
 that python is available on the worker node when using this technique.
  I have also tried without success the '-mapper blah.py' technique
 using shebang lines: #!/usr/bin/env python, although on the training
 VM Python is installed under /usr/bin/python.

 Maybe the issue is something else.  Any suggestions or insights will be 
 helpful.






Developing cross-component patches post-split

2010-02-17 Thread tiru


-- 
View this message in context: 
http://old.nabble.com/Developing-cross-component-patches-post-split-tp27634796p27634796.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



JavaDocs for DistCp (or similar)

2010-02-17 Thread Balu Vellanki
 Hi Folks

Currently we use distCp to transfer files between two hadoop clusters. I
have a perl script which calls a system command “hadoop distcp” to
achieve this.

Is there a Java Api to do distCp, so that we can avoid system calls from our
java code?

Thanks
Balu


Re: JavaDocs for DistCp (or similar)

2010-02-17 Thread Tsz Wo (Nicholas), Sze
Oops, DistCp.main(..) calls System.exit(..) at the end.  So it would also 
terminate your Java program.  It probably is not desirable.  You may still use 
similar codes as the ones in DistCp.main(..) as shown below.  However, they are 
not stable APIs.


//DistCp.main
  public static void main(String[] args) throws Exception {
JobConf job = new JobConf(DistCp.class);
DistCp distcp = new DistCp(job);
int res = ToolRunner.run(distcp, args);
System.exit(res);
  }

Nicholas



- Original Message 
 From: Tsz Wo (Nicholas), Sze s29752-hadoopu...@yahoo.com
 To: common-user@hadoop.apache.org
 Sent: Wed, February 17, 2010 10:58:58 PM
 Subject: Re: JavaDocs for DistCp (or similar)
 
 Hi Balu,
 
 Unfortunately, DistCp does not have a public Java API.  One simple way is to 
 invoke DistCp.main(args) in your java program, where args is an array of the 
 string arguments you would pass in the command line.
 
 Hope this helps.
 Nicholas Sze
 
 
 
 
 - Original Message 
  From: Balu Vellanki 
  To: common-user@hadoop.apache.org 
  Sent: Wed, February 17, 2010 5:43:11 PM
  Subject: JavaDocs for DistCp (or similar)
  
  Hi Folks
  
  Currently we use distCp to transfer files between two hadoop clusters. I
  have a perl script which calls a system command “hadoop distcp” to
  achieve this.
  
  Is there a Java Api to do distCp, so that we can avoid system calls from our
  java code?
  
  Thanks
  Balu