Re: java.io.IOException: Task process exit with nonzero status of 1
You might be running out of disk space. Check for that on your cluster nodes. -Prashant On Fri, May 11, 2012 at 12:21 AM, JunYong Li lij...@gmail.com wrote: is there errors in the task outpu file? on the jobtracker.jsp click the Jobid link - tasks link - Taskid link - Task logs link 2012/5/11 Mohit Kundra mohit@gmail.com Hi , I am new user to hadoop . I have installed hadoop0.19.1 on single windows machine. Its http://localhost:50030/jobtracker.jsp and http://localhost:50070/dfshealth.jsp pages are working fine but when i am executing bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 It is showing below $ bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 cygpath: cannot create short name of D:hadoop-0.19.1logs Number of Maps = 5 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Starting Job 12/05/11 12:07:26 INFO mapred.JobClient: Running job: job_20120513_0002 12/05/11 12:07:27 INFO mapred.JobClient: map 0% reduce 0% 12/05/11 12:07:35 INFO mapred.JobClient: Task Id : attempt_20120513_0002_m_06_ 0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:425) Please tell me what is the root cause regards , Mohit -- Regards Junyong
Re: java.io.IOException: Task process exit with nonzero status of 1
Mohit, Why are you using Hadoop-0.19, a version released many years ago? Please download the latest stable available at http://hadoop.apache.org/common/releases.html#Download instead. On Fri, May 11, 2012 at 12:26 PM, Mohit Kundra mohit@gmail.com wrote: Hi , I am new user to hadoop . I have installed hadoop0.19.1 on single windows machine. Its http://localhost:50030/jobtracker.jsp and http://localhost:50070/dfshealth.jsp pages are working fine but when i am executing bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 It is showing below $ bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 cygpath: cannot create short name of D:hadoop-0.19.1logs Number of Maps = 5 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Starting Job 12/05/11 12:07:26 INFO mapred.JobClient: Running job: job_20120513_0002 12/05/11 12:07:27 INFO mapred.JobClient: map 0% reduce 0% 12/05/11 12:07:35 INFO mapred.JobClient: Task Id : attempt_20120513_0002_m_06_ 0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:425) Please tell me what is the root cause regards , Mohit -- Harsh J
Re: Monitoring Hadoop Cluster
zabbix does monitoring, archiving and graphing, and alerts. It has a JMX bean monitor system. If Hadoop has these, or you can add them easily, you have a great monitor. Also, check out 'Starfish'. It's a little old, but I got it running and it was really cool. On Thu, May 10, 2012 at 11:24 PM, Manu S manupk...@gmail.com wrote: Thanks a lot Junyong On Fri, May 11, 2012 at 11:15 AM, JunYong Li lij...@gmail.com wrote: Each has its own merits. http://developer.yahoo.com/hadoop/tutorial/module7.html#monitoring 2012/5/11 Manu S manupk...@gmail.com Hi All, Which is the best monitoring tool for Hadoop cluster monitoring? Ganglia or Nagios? Thanks, Manu S -- Regards Junyong -- Lance Norskog goks...@gmail.com
Re: Monitoring Hadoop Cluster
I've helped out linking hadoop to munin using jmx querying in the past, there's a writeup at: http://www.cs.huji.ac.il/wikis/MediaWiki/lawa/index.php/Munin_for_Hadoop Stu On Fri, May 11, 2012 at 02:15:16AM -0700, Lance Norskog wrote: zabbix does monitoring, archiving and graphing, and alerts. It has a JMX bean monitor system. If Hadoop has these, or you can add them easily, you have a great monitor. Also, check out 'Starfish'. It's a little old, but I got it running and it was really cool. On Thu, May 10, 2012 at 11:24 PM, Manu S manupk...@gmail.com wrote: Thanks a lot Junyong On Fri, May 11, 2012 at 11:15 AM, JunYong Li lij...@gmail.com wrote: Each has its own merits. http://developer.yahoo.com/hadoop/tutorial/module7.html#monitoring 2012/5/11 Manu S manupk...@gmail.com Hi All, Which is the best monitoring tool for Hadoop cluster monitoring? Ganglia or Nagios? Thanks, Manu S -- Regards Junyong -- Lance Norskog goks...@gmail.com -- From the prompt of Stu Teasdale Happiness is a hard disk.
Re: High load on datanode startup
On Thu, May 10, 2012 at 5:58 PM, Raj Vishwanathan rajv...@yahoo.com wrote: Darrell Are the new dn,nn and mapred directories on the same physical disk? Nothing on NFS , correct? Yes, that's correct Could you be having some hardware issue? Any clue in /var/log/messages or dmesg? Hardware is good, all logs are clean. A non responsive system indicates a CPU that is really busy either doing something or waiting for something and the fact that it happens only on some nodes indicates a local problem. Yes, it was a very strange problem, which I seemed to have solved (for now). So, yesterday I upgraded the cluster to cdh4, and I found some of the nodes started to display similar behaviour but was able to catch then early enough to do something about it, the solution was to remove the hadoop-env.sh that I had copied over from the cdh3 install, the only thing I had added to this file was the following which I did to get pig/hbase talking : export HADOOP_CLASSPATH=`/usr/bin/hbase classpath`:$HADOOP_CLASSPATH What I saw on the machine was thousands of recursive processes in ps of the form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean the processes up so had to kill them manually with some grep/xargs foo. Once this was all cleaned up and the hadoop-env.sh file removed the nodes seem to be happy again. Darrell. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Cc: Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, May 10, 2012 3:57 AM Subject: Re: High load on datanode startup On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon t...@cloudera.com wrote: That's real weird.. If you can reproduce this after a reboot, I'd recommend letting the DN run for a minute, and then capturing a jstack pid of dn as well as the output of top -H -p pid of dn -b -n 5 and send it to the list. What I did after the reboot this morning was to move the my dn, nn, and mapred directories out of the the way, create a new one, formatted it, and restarted the node, it's now happy. I'll try moving the directories back later and do the jstack as you suggest. What JVM/JDK are you using? What OS version? root@pl446:/# dpkg --get-selections | grep java java-common install libjaxp1.3-java install libjaxp1.3-java-gcj install libmysql-java install libxerces2-java install libxerces2-java-gcj install sun-java6-bin install sun-java6-javadbinstall sun-java6-jdk install sun-java6-jre install root@pl446:/# java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) root@pl446:/# cat /etc/issue Debian GNU/Linux 6.0 \n \l -Todd On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor darrell.tay...@gmail.com wrote: On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The picture either too small or too pixelated for my eyes :-) There should be a zoom option in the top right of the page that allows you to view it full size Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Sorry, I'm unable to login to the box, it's completely unresponsive. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Wednesday, May 9, 2012 2:40 PM Subject: Re: High load on datanode startup On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote: When you say 'load', what do you mean? CPU load or something else? I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for
Re: High load on datanode startup
On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor darrell.tay...@gmail.com wrote: What I saw on the machine was thousands of recursive processes in ps of the form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean the processes up so had to kill them manually with some grep/xargs foo. Once this was all cleaned up and the hadoop-env.sh file removed the nodes seem to be happy again. Ah -- maybe the issue is that... my guess is that hbase classpath is now trying to include the Hadoop dependencies using hadoop classpath. But hadoop classpath was recursing right back because of that setting in hadoop-env. Basically you made a fork bomb - that explains the shape of the graph in Ganglia perfectly. -Todd Darrell. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Cc: Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, May 10, 2012 3:57 AM Subject: Re: High load on datanode startup On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon t...@cloudera.com wrote: That's real weird.. If you can reproduce this after a reboot, I'd recommend letting the DN run for a minute, and then capturing a jstack pid of dn as well as the output of top -H -p pid of dn -b -n 5 and send it to the list. What I did after the reboot this morning was to move the my dn, nn, and mapred directories out of the the way, create a new one, formatted it, and restarted the node, it's now happy. I'll try moving the directories back later and do the jstack as you suggest. What JVM/JDK are you using? What OS version? root@pl446:/# dpkg --get-selections | grep java java-common install libjaxp1.3-java install libjaxp1.3-java-gcj install libmysql-java install libxerces2-java install libxerces2-java-gcj install sun-java6-bin install sun-java6-javadb install sun-java6-jdk install sun-java6-jre install root@pl446:/# java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) root@pl446:/# cat /etc/issue Debian GNU/Linux 6.0 \n \l -Todd On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor darrell.tay...@gmail.com wrote: On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The picture either too small or too pixelated for my eyes :-) There should be a zoom option in the top right of the page that allows you to view it full size Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Sorry, I'm unable to login to the box, it's completely unresponsive. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Wednesday, May 9, 2012 2:40 PM Subject: Re: High load on datanode startup On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote: When you say 'load', what do you mean? CPU load or something else? I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if someone could give some pointers with a problem I'm having? I have a 7 machine cluster setup for testing and we have been pouring data into it for a week without issue, have learnt several thing along the way and solved all the problems up to now by searching online, but now I'm stuck. One of the data nodes decided to have a load of 70+ this morning, stopping datanode and tasktracker brought it back to normal, but every time I start the datanode again the load shoots through the roof, and all I get in the logs is : STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pl464/10.20.16.64 STARTUP_MSG: args = []
Re: High load on datanode startup
Doesn't look like the $HBASE_HOME/bin/hbase script runs $HADOOP_HOME/bin/hadoop classpath directly. Its classpath builder seems to add $HADOOP_HOME items manually via listing/etc.. Perhaps if hbase-env.sh has a HBASE_CLASSPATH that imports `hadoop classpath`, and the hadoop-env.sh has a `hbase classpath` this issue could happen. I do know that `hbase classpath` may take very long and/or hang over network calls if there's a target/build directory inside of $HBASE_HOME, which causes it to use maven to generate a classpath instead of using a cached file/local gen. Generally doing mvn clean solves that up for me, whenever it happens over my installs. On Fri, May 11, 2012 at 3:02 PM, Todd Lipcon t...@cloudera.com wrote: On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor darrell.tay...@gmail.com wrote: What I saw on the machine was thousands of recursive processes in ps of the form 'bash /usr/bin/hbase classpath...', Stopping everything didn't clean the processes up so had to kill them manually with some grep/xargs foo. Once this was all cleaned up and the hadoop-env.sh file removed the nodes seem to be happy again. Ah -- maybe the issue is that... my guess is that hbase classpath is now trying to include the Hadoop dependencies using hadoop classpath. But hadoop classpath was recursing right back because of that setting in hadoop-env. Basically you made a fork bomb - that explains the shape of the graph in Ganglia perfectly. -Todd Darrell. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Cc: Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, May 10, 2012 3:57 AM Subject: Re: High load on datanode startup On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon t...@cloudera.com wrote: That's real weird.. If you can reproduce this after a reboot, I'd recommend letting the DN run for a minute, and then capturing a jstack pid of dn as well as the output of top -H -p pid of dn -b -n 5 and send it to the list. What I did after the reboot this morning was to move the my dn, nn, and mapred directories out of the the way, create a new one, formatted it, and restarted the node, it's now happy. I'll try moving the directories back later and do the jstack as you suggest. What JVM/JDK are you using? What OS version? root@pl446:/# dpkg --get-selections | grep java java-common install libjaxp1.3-java install libjaxp1.3-java-gcj install libmysql-java install libxerces2-java install libxerces2-java-gcj install sun-java6-bin install sun-java6-javadb install sun-java6-jdk install sun-java6-jre install root@pl446:/# java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) root@pl446:/# cat /etc/issue Debian GNU/Linux 6.0 \n \l -Todd On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor darrell.tay...@gmail.com wrote: On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The picture either too small or too pixelated for my eyes :-) There should be a zoom option in the top right of the page that allows you to view it full size Can you login to the box and send the output of top? If the system is unresponsive, it has to be something more than an unbalanced hdfs cluster, methinks. Sorry, I'm unable to login to the box, it's completely unresponsive. Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Wednesday, May 9, 2012 2:40 PM Subject: Re: High load on datanode startup On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan rajv...@yahoo.com wrote: When you say 'load', what do you mean? CPU load or something else? I mean in the unix sense of load average, i.e. top would show a load of (currently) 376. Looking at Ganglia stats for the box it's not CPU load as such, the graphs shows actual CPU usage as 30%, but the number of running processes is simply growing in a linear manner - screen shot of ganglia page here : https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink Raj From: Darrell Taylor darrell.tay...@gmail.com To: common-user@hadoop.apache.org Sent: Wednesday, May 9, 2012 9:52 AM Subject: High load on datanode startup Hi, I wonder if
Re: freeze a mapreduce job
I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: freeze a mapreduce job
Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: freeze a mapreduce job
thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segel michael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
How to maintain record boundaries
Hi When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file. While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data. How would I make sure that my record doesn't get split, how would my Input format make a difference now ? Regards Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.
Re: How to maintain record boundaries
Shreya, This has been asked several times before, and the way it is handled by TextInputFormats (for one example) is explained at http://wiki.apache.org/hadoop/HadoopMapReduce in the Map section. If you are writing a custom reader, feel free to follow the same steps - you basically need to seek over to next blocks for an end-record marker and not limit yourself to just one-block reads. All input formats provided in MR handle this already for you, and you needn't worry about this unless you're implementing a whole new reader from scratch. On Fri, May 11, 2012 at 5:45 PM, shreya@cognizant.com wrote: Hi When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file. While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data. How would I make sure that my record doesn't get split, how would my Input format make a difference now ? Regards Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. -- Harsh J
Re: java.io.IOException: Task process exit with nonzero status of 1
Hi Mohit, 1) Hadoop is more portable with Linux,Ubantu or any non dos file system. but you are running hadoop on window it colud be the problem bcz hadoop will generate some partial out put file for temporary use. 2) Another thing is that your are running hadoop version as 0.19 , I think if you upgrade the version it will solve your problem. why bcz example what exactly you are using it is having some problem with FileRead and Write with Window OS. 3) Check your input file data bcz i could see your mapper is also 0% 4) If your are all right with whole scenario . please could your share your logs under hadoopversion/logs there it self we can trace it very clearly. Thanks SAMIR On Fri, May 11, 2012 at 12:26 PM, Mohit Kundra mohit@gmail.com wrote: Hi , I am new user to hadoop . I have installed hadoop0.19.1 on single windows machine. Its http://localhost:50030/jobtracker.jsp and http://localhost:50070/dfshealth.jsp pages are working fine but when i am executing bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 It is showing below $ bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 cygpath: cannot create short name of D:hadoop-0.19.1logs Number of Maps = 5 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Starting Job 12/05/11 12:07:26 INFO mapred.JobClient: Running job: job_20120513_0002 12/05/11 12:07:27 INFO mapred.JobClient: map 0% reduce 0% 12/05/11 12:07:35 INFO mapred.JobClient: Task Id : attempt_20120513_0002_m_06_ 0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:425) Please tell me what is the root cause regards , Mohit
transferring between HDFS which reside in different subnet
Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: transferring between HDFS which reside in different subnet
If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: transferring between HDFS which reside in different subnet
I can not cross access HDFS. Though HDFS2 has two NIC the HDFS is running on the other subnet. On Fri, May 11, 2012 at 3:57 PM, Shi Yu sh...@uchicago.edu wrote: If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: freeze a mapreduce job
Is there any risk to suppress a job too long in FS?I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM, Rita wrote: thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segelmichael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: How to maintain record boundaries
here are some quick code for you (based on Tom's book). You could overwrite the TextInputFormat isSplitable method to avoid splitting, which is pretty important and useful when processing sequence data. //Old API public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path file){ return false; } } //New API public class NonSplittableTextInputFormatNewAPI extends TextInputFormat { @Override protected boolean isSplitable(JobContext context, Path file){ return false; } } On 5/11/2012 7:19 AM, Harsh J wrote: Shreya, This has been asked several times before, and the way it is handled by TextInputFormats (for one example) is explained at http://wiki.apache.org/hadoop/HadoopMapReduce in the Map section. If you are writing a custom reader, feel free to follow the same steps - you basically need to seek over to next blocks for an end-record marker and not limit yourself to just one-block reads. All input formats provided in MR handle this already for you, and you needn't worry about this unless you're implementing a whole new reader from scratch. On Fri, May 11, 2012 at 5:45 PM,shreya@cognizant.com wrote: Hi When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file. While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data. How would I make sure that my record doesn't get split, how would my Input format make a difference now ? Regards Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.
Re: transferring between HDFS which reside in different subnet
It seems in your case HDFS2 could access HDFS, so you should be able to transfer HDFS data to HDFS2. If you want to cross-transfer, you don't need to do distcp on cluster nodes, if any client node (not necessary to be namenode, datanode, secondary node, etc.) could access to both HDFSs, then run transfer command on that client node. On 5/11/2012 9:03 AM, Arindam Choudhury wrote: I can not cross access HDFS. Though HDFS2 has two NIC the HDFS is running on the other subnet. On Fri, May 11, 2012 at 3:57 PM, Shi Yush...@uchicago.edu wrote: If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: transferring between HDFS which reside in different subnet
Looks like both are private subnets, so you got to route via a public default gateway. Try adding route using route command if your in linux(windows i have no idea). Just a thought i havent tried it though. Thanks, Rajesh Typed from mobile, please bear with typos. On May 11, 2012 10:03 AM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: I can not cross access HDFS. Though HDFS2 has two NIC the HDFS is running on the other subnet. On Fri, May 11, 2012 at 3:57 PM, Shi Yu sh...@uchicago.edu wrote: If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: transferring between HDFS which reside in different subnet
So, hadoop dfs -cp hdfs:// hdfs://... this will work. On Fri, May 11, 2012 at 4:14 PM, Rajesh Sai T tsairaj...@gmail.com wrote: Looks like both are private subnets, so you got to route via a public default gateway. Try adding route using route command if your in linux(windows i have no idea). Just a thought i havent tried it though. Thanks, Rajesh Typed from mobile, please bear with typos. On May 11, 2012 10:03 AM, Arindam Choudhury arindamchoudhu...@gmail.com wrote: I can not cross access HDFS. Though HDFS2 has two NIC the HDFS is running on the other subnet. On Fri, May 11, 2012 at 3:57 PM, Shi Yu sh...@uchicago.edu wrote: If you could cross-access HDFS from both name nodes, then it should be transferable using /distcp /command. Shi * * On 5/11/2012 8:45 AM, Arindam Choudhury wrote: Hi, I have a question to the hadoop experts: I have two HDFS, in different subnet. HDFS1 : 192.168.*.* HDFS2: 10.10.*.* the namenode of HDFS2 has two NIC. One connected to 192.168.*.* and another to 10.10.*.*. So, is it possible to transfer data from HDFS1 to HDFS2 and vice versa. Regards, Arindam
Re: freeze a mapreduce job
I haven't seen any. Haven't really had to test that... On May 11, 2012, at 9:03 AM, Shi Yu wrote: Is there any risk to suppress a job too long in FS?I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM, Rita wrote: thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segelmichael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: freeze a mapreduce job
Am not aware of a job-level timeout or idle monitor. On Fri, May 11, 2012 at 7:33 PM, Shi Yu sh...@uchicago.edu wrote: Is there any risk to suppress a job too long in FS? I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM, Rita wrote: thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segelmichael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- Harsh J
Re: freeze a mapreduce job
There is an idle timeout for map/reduce tasks. If a task makes no progress for 10 min (Default) the AM will kill it on 2.0 and the JT will kill it on 1.0. But I don't know of anything associated with a Job, other then in 0.23 is the AM does not heart beat back in for too long, I believe that the RM may kill it and retry, but I don't know for sure. --Bobby Evans On 5/11/12 10:53 AM, Harsh J ha...@cloudera.com wrote: Am not aware of a job-level timeout or idle monitor. On Fri, May 11, 2012 at 7:33 PM, Shi Yu sh...@uchicago.edu wrote: Is there any risk to suppress a job too long in FS?I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM, Rita wrote: thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segelmichael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- Harsh J
Question on MapReduce
Hi, I am a newbie on Hadoop and have a quick question on optimal compute vs. storage resources for MapReduce. If I have a multiprocessor node with 4 processors, will Hadoop schedule higher number of Map or Reduce tasks on the system than on a uni-processor system? In other words, does Hadoop detect denser systems and schedule denser tasks on multiprocessor systems? If yes, will that imply that it makes sense to attach higher capacity storage to store more number of blocks on systems with dense compute? Any insights will be very useful. Thanks, Satheesh
Re: DatanodeRegistration, socketTImeOutException
I have set dfs.datanode.max.xcievers=4096 and have swapping turned off, Regionserver Heap = 24 GB Datanode Heap = 1 GB On Fri, May 11, 2012 at 9:55 AM, sulabh choudhury sula...@gmail.com wrote: I have spent a lot of time trying to find a solution to this issue, but have had no luck. I think this is because of the Hbase's read/write pattern, but I do not see any related errors in hbase logs. Does not look like it is because of a GC pause, but seeing several 48 ms timeOut certainly suggests something is really slowing down the *writes *( I do see this only in the write ch.) In my dataNode logs I see tonnes of 2012-05-11 09:34:30,953 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.10.2.102:50010, storageID=DS-1494937024-10.10.2.102-50010-1305755343443, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5331817573170456741_12784653 to /10.10.2.102: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for *write*. ch : java.nio.channels.SocketChannel[connected local=/10.10.2.102:50010remote=/ 10.10.2.102:46752] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:267) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:163) 2012-05-11 09:34:30,953 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.10.2.102:50010, storageID=DS-1494937024-10.10.2.102-50010-1305755343443, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.10.2.102:50010remote=/ 10.10.2.102:46752] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:267) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:163) This block is mapped to a Hbase region, from NN logs :- 2012-05-10 15:46:35,117 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/table1/5a84f3844b7fd049c73a78b78ba6c2cf/.tmp/1639371300072460962. blk_4283960240517860151_12781124 2012-05-10 15:47:18,000 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.2.103:50010 is added to blk_4283960240517860151_12781124 size 134217728 2012-05-10 15:47:18,000 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.10.2.102:50010 is added to blk_4283960240517860151_12781124 size 134217728 I am running hbase-0.90.4-cdh3u3 on hadoop-0.20.2-cdh3u3 -- -- Thanks and Regards, Sulabh Choudhury
RE: Question on MapReduce
Nope, you must tune the config on that specific super node to have more M/R slots (this is for 1.0.x) This does not mean the JobTracker will be eager to stuff that super node with all the M/R jobs at hand. It still goes through the scheduler, Capacity Scheduler is most likely what you have. (check your config) IMO, If the data locality is not going to be there, your cluster is going to suffer from Network I/O. -Original Message- From: Satheesh Kumar [mailto:nks...@gmail.com] Sent: Friday, May 11, 2012 9:51 AM To: common-user@hadoop.apache.org Subject: Question on MapReduce Hi, I am a newbie on Hadoop and have a quick question on optimal compute vs. storage resources for MapReduce. If I have a multiprocessor node with 4 processors, will Hadoop schedule higher number of Map or Reduce tasks on the system than on a uni-processor system? In other words, does Hadoop detect denser systems and schedule denser tasks on multiprocessor systems? If yes, will that imply that it makes sense to attach higher capacity storage to store more number of blocks on systems with dense compute? Any insights will be very useful. Thanks, Satheesh
Re: Question on MapReduce
Thanks, Leo. What is the config of a typical data node in a Hadoop cluster - cores, storage capacity, and connectivity (SATA?).? How many tasktrackers scheduled per core in general? Is there a best practices guide somewhere? Thanks, Satheesh On Fri, May 11, 2012 at 10:48 AM, Leo Leung lle...@ddn.com wrote: Nope, you must tune the config on that specific super node to have more M/R slots (this is for 1.0.x) This does not mean the JobTracker will be eager to stuff that super node with all the M/R jobs at hand. It still goes through the scheduler, Capacity Scheduler is most likely what you have. (check your config) IMO, If the data locality is not going to be there, your cluster is going to suffer from Network I/O. -Original Message- From: Satheesh Kumar [mailto:nks...@gmail.com] Sent: Friday, May 11, 2012 9:51 AM To: common-user@hadoop.apache.org Subject: Question on MapReduce Hi, I am a newbie on Hadoop and have a quick question on optimal compute vs. storage resources for MapReduce. If I have a multiprocessor node with 4 processors, will Hadoop schedule higher number of Map or Reduce tasks on the system than on a uni-processor system? In other words, does Hadoop detect denser systems and schedule denser tasks on multiprocessor systems? If yes, will that imply that it makes sense to attach higher capacity storage to store more number of blocks on systems with dense compute? Any insights will be very useful. Thanks, Satheesh
RE: Question on MapReduce
This maybe dated materials. Cloudera and HDP folks please correct with updates :) http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/ http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ Hope this helps. -Original Message- From: Satheesh Kumar [mailto:nks...@gmail.com] Sent: Friday, May 11, 2012 12:48 PM To: common-user@hadoop.apache.org Subject: Re: Question on MapReduce Thanks, Leo. What is the config of a typical data node in a Hadoop cluster - cores, storage capacity, and connectivity (SATA?).? How many tasktrackers scheduled per core in general? Is there a best practices guide somewhere? Thanks, Satheesh On Fri, May 11, 2012 at 10:48 AM, Leo Leung lle...@ddn.com wrote: Nope, you must tune the config on that specific super node to have more M/R slots (this is for 1.0.x) This does not mean the JobTracker will be eager to stuff that super node with all the M/R jobs at hand. It still goes through the scheduler, Capacity Scheduler is most likely what you have. (check your config) IMO, If the data locality is not going to be there, your cluster is going to suffer from Network I/O. -Original Message- From: Satheesh Kumar [mailto:nks...@gmail.com] Sent: Friday, May 11, 2012 9:51 AM To: common-user@hadoop.apache.org Subject: Question on MapReduce Hi, I am a newbie on Hadoop and have a quick question on optimal compute vs. storage resources for MapReduce. If I have a multiprocessor node with 4 processors, will Hadoop schedule higher number of Map or Reduce tasks on the system than on a uni-processor system? In other words, does Hadoop detect denser systems and schedule denser tasks on multiprocessor systems? If yes, will that imply that it makes sense to attach higher capacity storage to store more number of blocks on systems with dense compute? Any insights will be very useful. Thanks, Satheesh
Re: How to maintain record boundaries
Record reader implementations are typically written to honor record boundaries. This means that while reading a split data they will continue reading if the end of split has reached BUT end of record is yet to be encountered. -@nkur On 5/11/12 5:15 AM, shreya@cognizant.com shreya@cognizant.com wrote: Hi When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file. While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data. How would I make sure that my record doesn't get split, how would my Input format make a difference now ? Regards Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.
Resource underutilization / final reduce tasks only uses half of cluster ( tasktracker map/reduce slots )
I see mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum, but I'm wondering if there isn't another tuning parameter I need to look at. I can tune the task tracker so that when I have many jobs running, with many simultaneous maps and reduces I utilize 95% of cpu and memory. Inevitably though I end up with a huge final reduce task that only uses half of of my cluster because I have reserved the other half for Mapping. Is there a way around this problem? Seems like there should also be a maximum number of reducers conditional on no Map tasks running. -JD
Moving files from JBoss server to HDFS
Hello, We have a large number of custom-generated files (not just web logs) that we need to move from our JBoss servers to HDFS. Our first implementation ran a cron job every 5 minutes to move our files from the output directory to HDFS. Is this recommended? We are being told by our IT team that our JBoss servers should not have access to HDFS for security reasons. The files must be sucked to HDFS by other servers that do not accept traffic from the outside. In essence, they are asking for a layer of indirection. Instead of: {JBoss server} -- {HDFS} it's being requested that it look like: {Separate server} -- {JBoss server} and then {Separate server} -- HDFS While I understand in principle what is being said, the security of having processes on JBoss servers writing files to HDFS doesn't seem any worse than having Tomcat servers access a central database, which they do. Can anyone comment on what a recommended approach would be? Should our JBoss servers push their data to HDFS or should the data be pulled by another server and then placed into HDFS? Thank you! FT