RE: Automate Hadoop installation
Hi Praveenesh, I had created VMs images with OS /hadoop nodes pre-configured which I would start as per requirement. But if you plan to do at the hardware level then Linux provides with kickstart type of configuration, which allows OS / Package installations automatically (network configuration is done through DHCP). This requires a TFTP client and DHCP server and hardware supporting network boot capabilities. Also something like puppet / shell-scripts can be configured like you mentioned, which I have used, but not for Hadoop. Thanks, Sagar -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Monday, December 05, 2011 4:02 PM To: common-user@hadoop.apache.org Subject: Automate Hadoop installation Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installations on so many machines parallely ? What are the best practices for this ? Thanks, Praveenesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
I am running 64bit version. Have you setup SSH properly? On Dec 3, 2011, at 2:30 AM, Will L wrote: I am using 64-Bit Eclipse 3.7.1 Cocoa with Hadoop 0.20.205.0. I get the following error message: An internal error occurred during: Connecting to DFS localhost. org/apache/commons/configuration/Configuration From: seventeen_reas...@hotmail.com To: common-user@hadoop.apache.org Subject: RE: Help with Hadoop Eclipse Plugin on Mac OS X Lion Date: Fri, 2 Dec 2011 20:51:02 -0800 What version of Hadoop are you running on OS X Lion and are you running 32-bit or 64-bit version of Eclipse? Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion From: jign...@websoft.com Date: Fri, 2 Dec 2011 14:37:28 -0500 To: common-user@hadoop.apache.org I am running eclipse plugin in Lion OS X on eclipse 3.7. Take the plugin from contrib folder in dump to your eclipse plugin library. If doesn't work remove eclipse and reinstall a fresh version. -Jignesh On Dec 2, 2011, at 11:59 AM, Prashant Sharma wrote: nice to know Will, well the way i said you have the same luxury as far as you are running in stand-alone mode which is ideal for development. On Fri, Dec 2, 2011 at 10:02 PM, Will L seventeen_reas...@hotmail.comwrote: I got the setup working under my laptop running OS X Snow Leopard without any problems and I would like to use my new laptop running OS X Lion. The plugin is helpful in that I can see hadoop output being dumped to the eclipse console and it used to integrate well with the Eclipse IDE making my development life a little easier. Thank you for your time and help. Sincerely, Will Lieu Date: Fri, 2 Dec 2011 21:44:36 +0530 Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion From: prashant.ii...@gmail.com To: common-user@hadoop.apache.org Why do you need a plugin at all? you can do away with it by having a maven project i.e. having a pom.xml and setting hadoop as one of the dependencies. Then use regular maven commands to build etc.. e.g. mvn eclipse:eclipse would be an interesting command. On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.com wrote: Oops guess the formatting went away: I have tried the following combinations: * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar From: seventeen_reas...@hotmail.com To: common-user@hadoop.apache.org Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion Date: Fri, 2 Dec 2011 00:26:28 -0800 Hello, I am having problems getting my hadoop eclipse plugin to work on Mac OS X Lion. I have tried the following combinations: Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion? Thank you for your time and help I greatly appreciate it! Sincerely, Will
Multiple Mappers for Multiple Tables
I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Re: Automate Hadoop installation
These that great project called BigTop (in the apache incubator) which provides for building of Hadoop stack. The part of what it provides is a set of Puppet recipes which will allow you to do exactly what you're looking for with perhaps some minor corrections. Serious, look at Puppet - otherwise it will be a living through nightmare of configuration mismanagements. Cos On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote: Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installations on so many machines parallely ? What are the best practices for this ? Thanks, Praveenesh
Hadoop Profiling
I turned on the profiling in Hadoop, and the MapReduceTutorial at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html says that the profile files should go to the user log directory. However, they're currently going to the working directory where I start the hadoop job from. I've set $HADOOP_LOG_DIR but that hasn't made a difference. What do I need to change or set in order for the profile files to go to the correct log directory? Thanks.
Re: Multiple Mappers for Multiple Tables
Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Pig Output
Using PigStorage() my pig script output gets put into partial files on the hadoop file system. When I use the copyToLocal fuction from Hadoop it creates a local directory with all the partial files. Is there a way to copy the partial files from hadoop into a single local file? Thanks
Re: Multiple Mappers for Multiple Tables
Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would be constrained with a limit on maximum simultaneous connections. Also you need to ensure that that the same Query is not executed n number of times in n mappers all fetching the same data, It'd be just wastage of network. Sqoop + Hive would be my recommendation and a good combination for such use cases. If you have Pig competency you can also look into pig instead of hive. Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.comwrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Re: Pig Output
Hi Aaron Instead of copyFromLocal use getmerge. It would do your job. The syntax for CLI is hadoop fs -getmerge source dir in hdfs/pig output dir lfs destn dir/xyz.txt Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:57 AM, Aaron Griffith aaron.c.griff...@gmail.comwrote: Using PigStorage() my pig script output gets put into partial files on the hadoop file system. When I use the copyToLocal fuction from Hadoop it creates a local directory with all the partial files. Is there a way to copy the partial files from hadoop into a single local file? Thanks
MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead mapper and restarting the mapper job, even it means aborting the reducers? 3. why isn't the DataNode being used to fetch the blocks? It is still up and running when this happens, so shouldn't it know where the files are in HDFS? Thanks, Chris
Running a job continuously
Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi Chris, I'd suggest updating to a newer version of your hadoop distro - you're hitting some bugs that were fixed last summer. In particular, you're missing the amendment patch from MAPREDUCE-2373 as well as some patches to MR which make the fetch retry behavior more aggressive. -Todd On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead mapper and restarting the mapper job, even it means aborting the reducers? 3. why isn't the DataNode being used to fetch the blocks? It is still up and running when this happens, so shouldn't it know where the files are in HDFS? Thanks, Chris -- Todd Lipcon Software Engineer, Cloudera
Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi Chris From the stack trace, it looks like a JVM corruption issue. It is a known issue and have been fixed in CDH3u2, i believe an upgrade would solve your issues. https://issues.apache.org/jira/browse/MAPREDUCE-3184 Then regarding your queries,I'd try to help you out a bit.In mapreduce the data transfer between map and reduce happens over http. If jetty is down then that won't happen which means map output in one location wont be accessible to reducer in another location. The map outputs are in LFS and not on HDFS so even if the data node on the machine is up we can't get the data in above circumstances. Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead
Re: Running a job continuously
Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
Re: Running a job continuously
Burak, Before we can really answer your question, you need to give us some more information on the processing you want to do. Do you want output that is continuous or batched (if so, how)? How should the output at a given time be related to the input up to then and the previous outputs? Regards, Mike
Re: Pig Output
hadoop dfs cat /my/path/* single_file Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Dec 5, 2011, at 12:30 PM, Aaron Griffith aaron.c.griff...@gmail.com wrote: Using PigStorage() my pig script output gets put into partial files on the hadoop file system. When I use the copyToLocal fuction from Hadoop it creates a local directory with all the partial files. Is there a way to copy the partial files from hadoop into a single local file? Thanks
Re: Running a job continuously
You might also want to take a look at Storm, as thats what its design to do: https://github.com/nathanmarz/storm/wiki On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer mspre...@us.ibm.com wrote: Burak, Before we can really answer your question, you need to give us some more information on the processing you want to do. Do you want output that is continuous or batched (if so, how)? How should the output at a given time be related to the input up to then and the previous outputs? Regards, Mike -- Thanks, John C
Re: Multiple Mappers for Multiple Tables
Thanks Bejoy, I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a Path parameter. Are these paths just ignored here? On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would be constrained with a limit on maximum simultaneous connections. Also you need to ensure that that the same Query is not executed n number of times in n mappers all fetching the same data, It'd be just wastage of network. Sqoop + Hive would be my recommendation and a good combination for such use cases. If you have Pig competency you can also look into pig instead of hive. Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Re: Running a job continuously
Hi Burak, The model of hadoop is very different, it is based on Job based model, in more easy words its a kind of Batch model where map reduce job is executed on a batch of data which is already present. As per your requirement, word count example doesn't make sense if the file has been written continuously. However word count per hour or per min make sense in map reduce type of program. I second what Bejoy has mentioned is to use flume, aggregate the data and then run map reduce. Hadoop can give you near real time by using flume with Map reduce where you can run map reduce job on flume dumped data every few mins. Second option is to see if your problem can be solved by Flume Decorator itself for real time experience. Regards, Abhishek On Mon, Dec 5, 2011 at 2:33 PM, burakkk burak.isi...@gmail.com wrote: Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
RE: Running a job continuously
Hi Burak, Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Just to add to Bejoy's point, with Oozie, you can specify the data dependency for running your job. When specific amount of data is in, your can configure Oozie to run your job. I think this will suffice your requirement. Regards, Ravi Teja From: burakkk [burak.isi...@gmail.com] Sent: 06 December 2011 04:03:59 To: mapreduce-u...@hadoop.apache.org Cc: common-user@hadoop.apache.org Subject: Re: Running a job continuously Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
Re: Availability of Job traces or logs
Arun, I want to test its behaviour under different size of jobs traces(meaning number of jobs say 5,10,25,50,100) under different number of nodes. Till now i was using only the test/data given by mumak which has 19 jobs and 1529 node topology. I don' have many nodes with me to run some programs and collect logs and use Rumen to generate traces. For the varying jobs part, you can run sleep jobs with varying number of map/reduce tasks and sleep times. For varying the cluster size, you can run multiple task-trackers on the same node. You can start with 5 tracker per node. Since you will be running sleep jobs, this should be ok. Make sure Hadoop security is turned off and default controller is used. Intelligently design your topology script which will club all the trackers on the same node under one rack. I want to control the split placements so i need to modify preferred locations for task attempts in the trace but the trace for even 19 jobs is huge. So, I was thinking whether i can get a small, medium and large number of Job traces with corresponding topology trace so that modifying will be easier. For this, you need to understand how Rumen handles job logs. I have created MAPREDUCE-3508 for adding filtering capabilities to Rumen. You can make use of this feature to modify Rumen output and play around with splits. You can also make use of this feature to select few jobs (say 10, 50 etc) from the input trace. Amar On 12/4/11 10:19 AM, ArunKumar arunk...@gmail.com wrote: Amar, I am attempting to write a new scheduler for Hadoop and test it using Mumak. 1 I want to test its behaviour under different size of jobs traces(meaning number of jobs say 5,10,25,50,100) under different number of nodes. Till now i was using only the test/data given by mumak which has 19 jobs and 1529 node topology. I don' have many nodes with me to run some programs and collect logs and use Rumen to generate traces. 2 I want to control the split placements so i need to modify preferred locations for task attempts in the trace but the trace for even 19 jobs is huge. So, I was thinking whether i can get a small, medium and large number of Job traces with corresponding topology trace so that modifying will be easier. Arun On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] ml-node+s472066n3556710...@n3.nabble.com wrote: Arun, You can very well run synthetic workloads like large scale sort, wordcount etc or more realistic workloads like PigMix ( https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent enough cluster, these workloads work pretty well. Is there a specific reason why you want traces of varied sizes from various organizations? How can i make sure that the rumen generates only say 25 jobs,50 jobs or so Do you want to get 25/50 jobs based on some filtering criterion? I recently faced a similar situation where I wanted to extract jobs from a Rumen trace based on job ids. I will be happy to share these filtering tools. Amar On 12/1/11 8:48 AM, ArunKumar [hidden email]http://user/SendEmail.jtp?type=nodenode=3556710i=0 wrote: Hi guys ! Apart from generating the job traces from RUMEN , can i get logs or job traces of varied sizes from some organizations. How can i make sure that the rumen generates only say 25 jobs,50 jobs or so ? Thanks, Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html To unsubscribe from Availability of Job traces or logs, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3550462code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3 . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Automate Hadoop installation
Hi, to deploy software I suggest pulp: https://fedorahosted.org/pulp/wiki/HowTo For a package-based distro (debian, redhat, centos) you can build apache's hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a redhat / centos take a look at spacewalk. best, Alex On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote: These that great project called BigTop (in the apache incubator) which provides for building of Hadoop stack. The part of what it provides is a set of Puppet recipes which will allow you to do exactly what you're looking for with perhaps some minor corrections. Serious, look at Puppet - otherwise it will be a living through nightmare of configuration mismanagements. Cos On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote: Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installations on so many machines parallely ? What are the best practices for this ? Thanks, Praveenesh -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*