Re: Reducer stuck at pending state
Hi Todd,I'm using hadoop 0.20.1, apache distribution. I didnt set the property you mentioned and I think they should remain default (1G?). The cluster I'm playing with has four master nodes, and 96 slave nodes physically. Hadoop uses one master node for namenode and jobstracker, and picks 12 nodes for its data and tasktrackers. Interestingly, I noticed the hardware specification is a liltle different between master and slave mahchines. So I changed the namenode and jobstracker to one of the slaves. The problem seems solved. (My program runs normally SO FAR) However, I cannot find the concrete hardware configuration for each node, but I guess the differences should exist mainly on the CPUs or RAMs. These are copied from the cluster's specification manual: Slaves: each with two 2.6 GHz dual-core opteron processors, 8 GB RAM, 16 GB swap space and 50 GB of local scratch space Masters: each with four 2.6 GHz dual-core opteron processors, 32 GB RAM, 64 GB swap space, 64 GB of local scratch space Can you see what the problem is? Thanks a lot. Regards Song Liu On Wed, Feb 17, 2010 at 4:18 AM, Todd Lipcon t...@cloudera.com wrote: Hi Song, What version are you running? How much memory have you allocated to the reducers in mapred.child.java.opts? -Todd On Tue, Feb 16, 2010 at 4:01 PM, Song Liu lamfeeli...@gmail.com wrote: Sorry, seems no attachment is allowed, I paste it here: JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % CompleteReduce TotalReduces Completed Job Scheduling Information job_2... NORMAL sl9885TF/IDF 100.00% 26 260.00% 10 NA job_2... NORMAL sl9885Rank100.00% 2222 0.00% 10NA job_2... NORMAL sl9885TF/IDF 100.00% 20 200.00%10 NA The format is horrible, sorry for that, but it's the best I can do :( BTW, I guess it should not be my program's problem, since I have tested it on some other clusters before. Regards Song Liu On Tue, Feb 16, 2010 at 11:51 PM, Song Liu lamfeeli...@gmail.com wrote: Hi all, I recently have me t a problem that sometimes, reducer hang up at pending state, with 0% complete. It seems all the mappers are completely done, and when it just about to start the reducer, the reducer stuck, without any given warnings and errors and was staying at the pending state. I have a cluster with 12 nodes. But this situation only appears when the scale of data is large (2GB or more), smaller cases never met this problem. Any one has met this issue before? I searched JIRA, some one proposed this issue before, but no solution was given. ( https://issues.apache.org/jira/browse/MAPREDUCE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647230#action_12647230 ) The typical case of this issue is captured in the attachment. Regards Song Liu
Need your Help sir
Dear sir, i want your help, i want to deploy hadoop core using eclipse. now hadoop-core divided hadoop -common , hadoop -hdfs, hadoop-mapreduce. i could tried many times, but hadoop-common and hadoop-mapreduce is build successfully , then hadoop-hdfs also build successfully. my doudht is when i build hadoop-mapreduce project, that project automaticaly created this jar file hadoop-mapred-0.22.0-SNAPSHOT.jar. As soon as i build hadoop-hdfs project , that project did not create jar file inside the build folder? please help me Thank you with Regards VTM
Issue with Hadoop cluster on Amazon ec2
Hi, We have deployed hadoop cluster on EC2, hadoop version 0.20.1. We are having couple of data nodes. We want to get some files from the data node which is there on the amazon ec2 instance to our local instance using java application, which in turn use SequentialFile.reader to read file. The problem is amazon uses private IP for host communication, but to connect form the environment other than amazon we will be using public IP. So when we try to connect to the data nodes via name node, it will report data node's private IP and using the same we are not able to reach the data node. Is there any way we can set name node to send data nodes public NAT IP not the internal IP, or any other work around is there to overcome this problem. Thanks Viral.
Re: Issue with Hadoop cluster on Amazon ec2
viral shah wrote: Hi, We have deployed hadoop cluster on EC2, hadoop version 0.20.1. We are having couple of data nodes. We want to get some files from the data node which is there on the amazon ec2 instance to our local instance using java application, which in turn use SequentialFile.reader to read file. The problem is amazon uses private IP for host communication, but to connect form the environment other than amazon we will be using public IP. So when we try to connect to the data nodes via name node, it will report data node's private IP and using the same we are not able to reach the data node. That's a feature to stop you accidentally exporting your entire HDFS filesystem to the rest of the world. Is there any way we can set name node to send data nodes public NAT IP not the internal IP, or any other work around is there to overcome this problem. -push up the data to the s3 filestore first, have the job sequence start from s3 and finish there too
Difficulty connecting Hadoop JMX service
I want to monitor my hadoop cluster services using check_jmx nagios plugin. I use following env. variables in the hadoop-env.sh file export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false” # Command specific options appended to HADOOP_OPTS when specified export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″ export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″ export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″ export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″ export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″ export HADOOP_TASKTRACKER_OPTS=”-Dcom.sun.management.jmxremote.port=8009″ but the problem I am facing is that my hadoop machine is behind firewall and I can't open multiple ports. The JMX RMI connector opens two ports: one is for the RMI registry, and it's the port that you usually supply with the -Dcom.sun.management.jmxremote.port=port property. The other port is used to export JMX RMI connection objects. This second port is usually dynamically allocated at random. So I am not able to connect using Jconsole or check_jmx plugin. I tried using example provided at * http://blogs.sun.com/jmxetc/entry/connecting_through_firewall_using_jmx*, by changing env. variable like this export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote -Dexample.rmi.agent.port=3000 -javaagent:/root/install/asl-hadoop-0.20.1/lib/CustomAgent.jar $HADOOP_NAMENODE_OPTS The CustomAgent.jar file I created using above mentioned blog entry. Then when I start hadoop cluster using bin/start-all.sh I get following error. root/install/asl-hadoop-0.20.1/bin/hadoop-daemon.sh: line 96: 8983 Aborted nohup nice -n $HADOOP_NICENESS $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR $command $@ $log 21 /dev/null Create RMI registry on port 3000 Get the platform's MBean server Initialize the environment map Create an RMI connector server Start the RMI connector server on port 3000 service:jmx:rmi://domU-12-31-38-00-B4-F8:3000/jndi/rmi://domU-12-31-38-00-B4-F8:3000/jmxrmi Create RMI registry on port 3000 Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) localhost: starting datanode, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-datanode-domU-12-31-38-00-B4-F8.out localhost: starting secondarynamenode, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-secondarynamenode-domU-12-31-38-00-B4-F8.out starting jobtracker, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-jobtracker-domU-12-31-38-00-B4-F8.out localhost: starting tasktracker, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-tasktracker-domU-12-31-38-00-B4-F8.out can some help me, what I am doing wrong. Thanks, Viral.
Re: Difficulty connecting Hadoop JMX service
On Wed, Feb 17, 2010 at 11:22 AM, viral shah viral21...@gmail.com wrote: I want to monitor my hadoop cluster services using check_jmx nagios plugin. I use following env. variables in the hadoop-env.sh file export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false” # Command specific options appended to HADOOP_OPTS when specified export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″ export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″ export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″ export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″ export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″ export HADOOP_TASKTRACKER_OPTS=”-Dcom.sun.management.jmxremote.port=8009″ but the problem I am facing is that my hadoop machine is behind firewall and I can't open multiple ports. The JMX RMI connector opens two ports: one is for the RMI registry, and it's the port that you usually supply with the -Dcom.sun.management.jmxremote.port=port property. The other port is used to export JMX RMI connection objects. This second port is usually dynamically allocated at random. So I am not able to connect using Jconsole or check_jmx plugin. I tried using example provided at * http://blogs.sun.com/jmxetc/entry/connecting_through_firewall_using_jmx*, by changing env. variable like this export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote -Dexample.rmi.agent.port=3000 -javaagent:/root/install/asl-hadoop-0.20.1/lib/CustomAgent.jar $HADOOP_NAMENODE_OPTS The CustomAgent.jar file I created using above mentioned blog entry. Then when I start hadoop cluster using bin/start-all.sh I get following error. root/install/asl-hadoop-0.20.1/bin/hadoop-daemon.sh: line 96: 8983 Aborted nohup nice -n $HADOOP_NICENESS $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR $command $@ $log 21 /dev/null Create RMI registry on port 3000 Get the platform's MBean server Initialize the environment map Create an RMI connector server Start the RMI connector server on port 3000 service:jmx:rmi://domU-12-31-38-00-B4-F8:3000/jndi/rmi://domU-12-31-38-00-B4-F8:3000/jmxrmi Create RMI registry on port 3000 Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) localhost: starting datanode, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-datanode-domU-12-31-38-00-B4-F8.out localhost: starting secondarynamenode, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-secondarynamenode-domU-12-31-38-00-B4-F8.out starting jobtracker, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-jobtracker-domU-12-31-38-00-B4-F8.out localhost: starting tasktracker, logging to /root/install/asl-hadoop-0.20.1/bin/../logs/hadoop-root-tasktracker-domU-12-31-38-00-B4-F8.out can some help me, what I am doing wrong. Thanks, Viral. Yikes, That is a rather hairy problem. One possible work around, if you use SSL you might sidestep the RMI issues. (Do not quote me on that) http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html I always chose the course of doing my monitoring from the same subnet as the cluster to forgo the NAT issues, please do post your findings as this would be helpful to all doing jmx monitoring.
Re: Hadoop automatic job status check and notification?
Amogh, this really helps me a lot! Thanks! So, in summary, I guess there are the following options to do job notification or more generally job management stuff. I also guess Oozie / cascading is the better choice when we need to handle these externally. Anyway, without deep exploration of all these options, I certainly may have misunderstandings. Correct me please :) - Prepare some external script and poll job status by communicating with hadoop job [-list | -status | etc.] at a regular pace and take actions accordingly. (pros: simple, cons: need to poll status, not event-driven ) - Within a hadoop job written in java, make calls to appropriate job control functions to send out job status message if want. (pros: straightforward, cons: only for jobs in java) - Use Oozie / cascading to organize flow of hadoop jobs and other housekeeping job (e.g. pull back results, cleanup, shutdown clusters, and re-execute jobs against failure, etc.) (pros: powerful, can handle job control outside of jobs written in java/pig, cons: learning curve?) - Embedded pig (pros: works for jobs in pig scripts, cons: works for jobs in pig scripts) - What else? -- Michael --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Hadoop automatic job status check and notification? To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Wednesday, February 17, 2010, 2:45 AM Hi, In our case we launched Pig from perl script and handled re-execution, clean-up etc. from there. If you need to implement a workflow or DAG like model, consider looking at Oozie / cascading. If you are interested in diving little deeper, you can try embedded pig. Amogh On 2/17/10 1:53 PM, jiang licht licht_ji...@yahoo.com wrote: Thanks Amogh. So, I think the following will do the job: public void setJobEndNotificationURI(String uri)But what about hadoop jobs written in PIG scripts? Since PIG will take control, is there some convenient way to do the same thing as well? Thanks! -- Michael --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Hadoop automatic job status check and notification? To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Wednesday, February 17, 2010, 12:44 AM Hi, When you submit a job to the cluster, you can control the behavior for blocking / return using JobClient's submitJob, runJob methods. It will also let you know if the job was successful or failed, so you can design your follow up scripts accordingly. Amogh On 2/17/10 11:01 AM, jiang licht licht_ji...@yahoo.com wrote: New to Hadoop (now using 0.20.1), I want to do the following: Automatic status check and notification of hadoop jobs such that e.g. when a job is finished, a script can be trigged so that job results can be automatically pulled back to local machines and expensive hadoop cluster can be released or shutdown. So, what is the best way to do this? Thanks! -- Michael
Re: LZO compression for Map output in Hadoop 0.20+?
Haven't seen the part 2. I think this was complete. Morpheus: Do you believe in fate, Neo? Neo: No. Morpheus: Why Not? Neo: Because I don't like the idea that I'm not in control of my life. - Original Message From: jiang licht licht_ji...@yahoo.com To: common-user@hadoop.apache.org Sent: Wed, February 17, 2010 3:26:26 AM Subject: Re: LZO compression for Map output in Hadoop 0.20+? Thanks Himanshu. Is there a part 2? -- Michael --- On Tue, 2/16/10, himanshu chandola himanshu_cool...@yahoo.com wrote: From: himanshu chandola himanshu_cool...@yahoo.com Subject: Re: LZO compression for Map output in Hadoop 0.20+? To: common-user@hadoop.apache.org Date: Tuesday, February 16, 2010, 11:35 PM You might want to check out this: http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ Morpheus: Do you believe in fate, Neo? Neo: No. Morpheus: Why Not? Neo: Because I don't like the idea that I'm not in control of my life. - Original Message From: jiang licht licht_ji...@yahoo.com To: common-user@hadoop.apache.org Sent: Wed, February 17, 2010 12:26:48 AM Subject: LZO compression for Map output in Hadoop 0.20+? New to Hadoop (now using 0.20.1), I want to know how to choose and set up compression methods for Map output, especially how to configure and use LZO compression? Specifically, please share your experience for the following 2 scenarios. Thanks! (1) Is there a global setting in some hadoop configuration files for naming a compression method (e.g. LZO) such that it will be used to compress Map output by default? and how? (2) How to use a compression method (e.g. LZO) in java code (I noticed that in javadoc, org.apache.hadoop.mapred is labeld Deprecated)? Thanks! -- Michael
Re: Hadoop automatic job status check and notification?
On Wed, Feb 17, 2010 at 1:03 PM, jiang licht licht_ji...@yahoo.com wrote: Amogh, this really helps me a lot! Thanks! So, in summary, I guess there are the following options to do job notification or more generally job management stuff. I also guess Oozie / cascading is the better choice when we need to handle these externally. Anyway, without deep exploration of all these options, I certainly may have misunderstandings. Correct me please :) - Prepare some external script and poll job status by communicating with hadoop job [-list | -status | etc.] at a regular pace and take actions accordingly. (pros: simple, cons: need to poll status, not event-driven ) - Within a hadoop job written in java, make calls to appropriate job control functions to send out job status message if want. (pros: straightforward, cons: only for jobs in java) - Use Oozie / cascading to organize flow of hadoop jobs and other housekeeping job (e.g. pull back results, cleanup, shutdown clusters, and re-execute jobs against failure, etc.) (pros: powerful, can handle job control outside of jobs written in java/pig, cons: learning curve?) - Embedded pig (pros: works for jobs in pig scripts, cons: works for jobs in pig scripts) - What else? -- Michael --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Hadoop automatic job status check and notification? To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Wednesday, February 17, 2010, 2:45 AM Hi, In our case we launched Pig from perl script and handled re-execution, clean-up etc. from there. If you need to implement a workflow or DAG like model, consider looking at Oozie / cascading. If you are interested in diving little deeper, you can try embedded pig. Amogh On 2/17/10 1:53 PM, jiang licht licht_ji...@yahoo.com wrote: Thanks Amogh. So, I think the following will do the job: public void setJobEndNotificationURI(String uri)But what about hadoop jobs written in PIG scripts? Since PIG will take control, is there some convenient way to do the same thing as well? Thanks! -- Michael --- On Wed, 2/17/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Hadoop automatic job status check and notification? To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Wednesday, February 17, 2010, 12:44 AM Hi, When you submit a job to the cluster, you can control the behavior for blocking / return using JobClient's submitJob, runJob methods. It will also let you know if the job was successful or failed, so you can design your follow up scripts accordingly. Amogh On 2/17/10 11:01 AM, jiang licht licht_ji...@yahoo.com wrote: New to Hadoop (now using 0.20.1), I want to do the following: Automatic status check and notification of hadoop jobs such that e.g. when a job is finished, a script can be trigged so that job results can be automatically pulled back to local machines and expensive hadoop cluster can be released or shutdown. So, what is the best way to do this? Thanks! -- Michael Michael, That is a pretty good summary. Ozzie, cascading, are much more advanced work flow schedulers. For reference, I use the JobClient object http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html to poll the jobtracker and gather the information for these graphs. http://www.jointhegrid.com/hadoop-cacti-jtg-walk/running_job.jsp http://www.jointhegrid.com/hadoop-cacti-jtg-walk/maps_v_reduces.jsp This is fairly easy to do. After you get connected, you have methods like getAllJobs() or getJobById(String s) and can further interrogate the return objects for the information you want. In my case I am determining what state the jobs are in to draw a graph. Automatic status check and notification of hadoop jobs such that e.g. when a job is finished, a script can be trigged so that job results can be automatically pulled back to local machines and expensive hadoop cluster can be released or shutdown. Based on this requirement, you could also just handle the return code in the driver of your map reduce program and take action. javax.mail, messagebroker, etc.
Re: Why is $JAVA_HOME/lib/tools.jar in the classpath?
Thomas, What version of Hadoop are you building Debian packages for? If you're taking Cloudera's existing debs and modifying them, these include a backport of Sqoop (from Apache's trunk) which uses the rt tools.jar to compile auto-generated code at runtime. Later versions of Sqoop (including the one in the most recently-released CDH2: 0.20.1+169.56-1) include MAPREDUCE-1146 which eliminates that dependency. - Aaron On Tue, Feb 16, 2010 at 3:19 AM, Steve Loughran ste...@apache.org wrote: Thomas Koch wrote: Hi, I'm working on the Debian package for hadoop (the first version is already in the new queue for Debian unstable). Now I stumpled about $JAVA_HOME/lib/tools.jar in the classpath. Since Debian supports different JAVA runtimes, it's not that easy to know, which one the user currently uses and therefor I'd would make things easier if this jar would not be necessary. From searching and inspecting the SVN history I got the impression, that this is an ancient legacy that's not necessary (anymore)? I don't think hadoop core/hdf/maperd needs it. The only place where it would be needed is JSP-java-binary work, but as the JSPs are precompiled you can probably get away without it. Just add tests for all the JSPs to make sure they work. -steve
Question about Join.java example
Is there a typo in the Join.java example that comes with hadoop? It has the line: JobConf jobConf = new JobConf(getConf(), Sort.class); Shouldn't that be Join.class ? Is there an equivalent example that uses the later API instead of the deprecated calls?
Re: LZO compression for Map output in Hadoop 0.20+?
Use the following knobs: mapred.compress.map.output = true mapred.map.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec or call jobConf.setMapOutputCompressorClass(LzoCodec.class); You will need the native hadoop-gpl-compression library installed on all machines from http://code.google.com/p/hadoop-gpl-compression/ Arun On Feb 16, 2010, at 9:26 PM, jiang licht wrote: New to Hadoop (now using 0.20.1), I want to know how to choose and set up compression methods for Map output, especially how to configure and use LZO compression? Specifically, please share your experience for the following 2 scenarios. Thanks! (1) Is there a global setting in some hadoop configuration files for naming a compression method (e.g. LZO) such that it will be used to compress Map output by default? and how? (2) How to use a compression method (e.g. LZO) in java code (I noticed that in javadoc, org.apache.hadoop.mapred is labeld Deprecated)? Thanks! -- Michael
Re: MiniDFSCluster accessed via hdfs:// URL
Philip, Thanks... I examined your patch, however I don't see the difference between it and what I've got currently which is: Configuration conf = new Configuration(); MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null); URI uri = dfs.getFileSystem().getUri(); System.out.println(uri: + uri); What could be the difference? Jason On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com wrote: It is, though you have to ask it what port it's running. See the patch in https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does that. -- Philip On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is it possible to access a MiniDFSCluster via an hdfs:// URL? I ask because it seems to not work...
Re: MiniDFSCluster accessed via hdfs:// URL
Ok, I got this working... Thanks Philip! On Wed, Feb 17, 2010 at 4:01 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Philip, Thanks... I examined your patch, however I don't see the difference between it and what I've got currently which is: Configuration conf = new Configuration(); MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null); URI uri = dfs.getFileSystem().getUri(); System.out.println(uri: + uri); What could be the difference? Jason On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com wrote: It is, though you have to ask it what port it's running. See the patch in https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does that. -- Philip On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is it possible to access a MiniDFSCluster via an hdfs:// URL? I ask because it seems to not work...
Re: MiniDFSCluster accessed via hdfs:// URL
Out of curiosity, what was the crux of the problem? -- Philip On Wed, Feb 17, 2010 at 4:17 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ok, I got this working... Thanks Philip! On Wed, Feb 17, 2010 at 4:01 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Philip, Thanks... I examined your patch, however I don't see the difference between it and what I've got currently which is: Configuration conf = new Configuration(); MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null); URI uri = dfs.getFileSystem().getUri(); System.out.println(uri: + uri); What could be the difference? Jason On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger phi...@cloudera.com wrote: It is, though you have to ask it what port it's running. See the patch in https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does that. -- Philip On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is it possible to access a MiniDFSCluster via an hdfs:// URL? I ask because it seems to not work...
Re: Pass the TaskId from map to Reduce
Hi Ankit, For your problem, you can use getJobId(); in reduce(), then you will have the unique name and you can process the file in the map reduce. ANKITBHATNAGAR wrote: Hi, I was working on a scenario where in I am generating a file in close() function of my Map implementation. Since Map execution are worked concurrently, this file is overwritten. I was wondering how to name this file uniquely per map execution basic and then read in configure() function of reduce. I could give a task id as name of the file but dont know how will I read the same file in configure() as the task id would have changed Ankit -- View this message in context: http://old.nabble.com/Pass-the-TaskId-from-map-to-Reduce-tp27575531p27633914.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Pass the TaskId from map to Reduce
Hi Don, Thanks for your reply. I already tried this approach, however the the issue that i am facing that I was expecting all the maps to finish before any reduce starts.This is not happening for me. It looks like as one map finishes reduce starts. Thats why I called close().? Could you tell me when is closed function called after every map or after all the maps? Am I doing something wrong? Thanks Ankit -- View this message in context: http://old.nabble.com/Pass-the-TaskId-from-map-to-Reduce-tp27575531p27634001.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Hadoop Streaming File-not-found error on Cloudera's training VM
Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Yes, I have tried that when passing the script. Just now I tried: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output -file blah.py And got this error for a map task: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... -Dan On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote: Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Re: Hadoop Streaming File-not-found error on Cloudera's training VM
Todd, Thanks! This solved it. -Dan On Wed, Feb 17, 2010 at 8:00 PM, Todd Lipcon t...@cloudera.com wrote: Hi Dan, This is actually a bug in the release you're using. Please run: $ sudo apt-get update $ sudo apt-get install hadoop-0.20 Then restart the daemons (or the entire VM) and give it another go. Thanks -Todd On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr dsta...@gmail.com wrote: Yes, I have tried that when passing the script. Just now I tried: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output -file blah.py And got this error for a map task: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... -Dan On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon t...@cloudera.com wrote: Are you passing the python script to the cluster using the -file option? eg -mapper foo.py -file foo.py Thanks -Todd On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr dsta...@gmail.com wrote: Hi, I've tried posting this to Cloudera's community support site, but the community website getsatisfaction.com returns various server errors at the moment. I believe the following is an issue related to my environment within Cloudera's Training virtual machine. Despite having success running Hadoop streaming on other Hadoop clusters and on Cloudera's Training VM in local mode, I'm currently getting an error when attempting to run a simple Hadoop streaming job in the normal queue based mode on the Training VM. I'm thinking the error described below is an issue related to the worker node not recognizing the python reference in the script's top shebang line. The hadoop command I am executing is: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test_input/* -output output Where the test_input directory contains 3 UNIX formatted, single line files: training-vm: 3$ hadoop dfs -ls /user/training/test_input/ Found 3 items -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file1 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file2 -rw-r--r-- 1 training supergroup 11 2010-02-17 10:48 /user/training/test_input/file3 training-vm: 3$ hadoop dfs -cat /user/training/test_input/* test_line1 test_line2 test_line3 And where blah.py looks like (UNIX formatted): #!/usr/bin/python import sys for line in sys.stdin: print line The resulting Hadoop-Streaming error is: java.io.IOException: Cannot run program blah.py: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214) ... I get the same error when placing the python script on the HDFS, and then using this in the hadoop command: ... -mapper hdfs:///user/training/blah.py ... One suggestion found online, which may not be relevant to Cloudera's distribution, mentions that the first line of the hadoop-streaming python script (the shebang line) may not describe an applicable path for the system. The solution mentioned is to use: ... -mapper python blah.py ... in the Hadoop streaming command. This doesn't seem to work correctly for me, since I find that the lines from the input data files are also parsed by the Python interpreter. But this does reveal that python is available on the worker node when using this technique. I have also tried without success the '-mapper blah.py' technique using shebang lines: #!/usr/bin/env python, although on the training VM Python is installed under /usr/bin/python. Maybe the issue is something else. Any suggestions or insights will be helpful.
Developing cross-component patches post-split
-- View this message in context: http://old.nabble.com/Developing-cross-component-patches-post-split-tp27634796p27634796.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
JavaDocs for DistCp (or similar)
Hi Folks Currently we use distCp to transfer files between two hadoop clusters. I have a perl script which calls a system command “hadoop distcp” to achieve this. Is there a Java Api to do distCp, so that we can avoid system calls from our java code? Thanks Balu
Re: JavaDocs for DistCp (or similar)
Oops, DistCp.main(..) calls System.exit(..) at the end. So it would also terminate your Java program. It probably is not desirable. You may still use similar codes as the ones in DistCp.main(..) as shown below. However, they are not stable APIs. //DistCp.main public static void main(String[] args) throws Exception { JobConf job = new JobConf(DistCp.class); DistCp distcp = new DistCp(job); int res = ToolRunner.run(distcp, args); System.exit(res); } Nicholas - Original Message From: Tsz Wo (Nicholas), Sze s29752-hadoopu...@yahoo.com To: common-user@hadoop.apache.org Sent: Wed, February 17, 2010 10:58:58 PM Subject: Re: JavaDocs for DistCp (or similar) Hi Balu, Unfortunately, DistCp does not have a public Java API. One simple way is to invoke DistCp.main(args) in your java program, where args is an array of the string arguments you would pass in the command line. Hope this helps. Nicholas Sze - Original Message From: Balu Vellanki To: common-user@hadoop.apache.org Sent: Wed, February 17, 2010 5:43:11 PM Subject: JavaDocs for DistCp (or similar) Hi Folks Currently we use distCp to transfer files between two hadoop clusters. I have a perl script which calls a system command “hadoop distcp” to achieve this. Is there a Java Api to do distCp, so that we can avoid system calls from our java code? Thanks Balu