Cannot start yarn daemons
Hi, I am trying to install hadoop 0.23.1 SNAPSHOT. while starting yarn daemons.sh it shows the following error Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.hadoop.yarn.server.resourcemanager.ResourceManager. Program will exit.
Re: Container Launching
Any solutions. On Fri, Jan 6, 2012 at 5:15 PM, raghavendhra rahul raghavendhrara...@gmail.com wrote: Hi all, I am trying to write an application master.Is there a way to specify node1: 10 conatiners node2: 10 containers Can we specify this kind of list using the application master Also i set request.setHostName(client); where client is the hostname of a node I checked the log to find the following error java.io.FileNotFoundException: File /local1/yarn/.yarn/local/usercache/rahul_2011/appcache/ application_1325760852770_0001 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:431) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:815) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:700) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:697) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:697) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:122) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:237) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:67) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) i.e containers are only launching within the host where the application master runs while the other nodes always remain free. Any ideas.
how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang
Using Java Remote Method Invocation to make a UI for Hadoop
Hi, We are trying to make a UI for our HBase + Hadoop applications. Basically, what we want is a web front end that can be hosted on a different server and present data from HBase, as well as allow us to launch Map-Reduce jobs from the web browser. Our current approach is as follows: We have a 6-node development cluster with each node running Scientific Linux in a VM with the following configuration: Node - Hostname - Daemons Node 1 - namenode - namenode, secondarynamenode, regionserver, hbase-master, zookeeper Node 2 - jobtracker - jobtracker, zookeeper Node 3 - slave0 - zookeeper, datanode, tasktracker Node 4 - slave1 - datanode, tasktracker Node 5 - slave2 - datanode, tasktracker Node 6 - slave3 - datanode, tasktracker We have a 7th scientific linux VM running a Grails web application called BillyWeb. We have created several Map-Reduce applications that run on the namenode to process HBase data and populate a results table. Currently, these MR apps are run from the command line on namenode. We use the HBase REST interface to query data from the results table and present it in an AJAX-enabled web page. That is the reading part of the problem solved :) Now we need to have a HTML page with a button, that when clicked will execute an MR app on the namenode. This forms the writing part of the problem. Currently, we are attempting to do this using Java's Remote Method Invocation (RMI) feature. I have successfully created a Java client-server application where the server program runs on namenode and the client runs on BillyWeb successfully. The client program calls a remote object method to trigger the MR job on namenode and gets the result. We are now in the process of integrating it into the Grails webapp, by adding the source code and calling it using groovy taglibs. The only remaining issue we are wrestling with is the need to modify the security policy of the web app to allow access to the client portions of the source code to the remote server. Eventually we hope to have an internal web site that you can visit in any web browser to view custom-made visualisations of the data and execute parameterised MR jobs to process data stored in HBase (amongst other places). We would like to know your thoughts on this approach. In particular, this feels like a potentially convoluted approach to building a front-end that may feature several redundant steps (are we are reinventing too many wheels?). Is there an alternative approach that you can think of that might be more sensible? Can you think of any problems with the approach we are currently taking? Thanks, Tom
has bzip2 compression been deprecated?
Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html. Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM
Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hi, Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster. On 09-Jan-2012, at 5:50 PM, hao.wang wrote: Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang
Re: has bzip2 compression been deprecated?
Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html. Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM
RE: has bzip2 compression been deprecated?
Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html. Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at
increase number of map tasks
Hello, In hdfs we have set block size - 40bytes . Input Data set is as below terminated with line feed. data1 (5*8=40 bytes) data2 .. ... data10 But still we see only 2 map tasks spawned, should have been atleast 10 map tasks. Each mapper performs complex mathematical computation. Not sure how works internally. Line feed does not work. Even with below settings map tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically it should look like compute grid - computation in parallel. property nameio.bytes.per.checksum/name value30/value descriptionThe number of bytes per checksum. Must not be larger than io.file.buffer.size./description /property property namedfs.block.size/name value30/value descriptionThe default block size for new files./description /property property namemapred.tasktracker.map.tasks.maximum/name value10/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property single node with high configuration - 8 cpus and 8gb memory. Hence taking an example of 10 data items with line feeds. We want to utilize full power of machine - hence want at least 10 map tasks - each task needs to perform highly complex mathematical simulation. At present it looks like file data is the only way to specify number of map tasks via splitsize (in bytes) - but I prefer some criteria like line feed or whatever. How do we get 10 map tasks from above configuration - pls help. thanks -- View this message in context: http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33107775.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
TaskTracker: exception thrown when changing perms fails - hadoop on Windows
To Whom It May Concern: I am running: haddop version 0.20.203.0 (downloaded from http://hadoop.apache.org/common/ ) Java version 1.7.0_02 Cygwin and ssh on a Windows XP OS. Overall the problems I have seen in the log files involve unexpected directory permission settings of: rwxrwxrwxt when rwxr-xr-x was expected. In all but the TaskTracker, I was able to solve this problem manually simply by executing chmod -R 755. In the case of the TaskTracker, although I manually changed the permissions, when I run start-mapred.sh, it apparently is recreating this part of the directory tree with the rwxrwxrwxt perms, then programatically attempting to change the permissions to 755. An exception is thrown (please see the log file contents below) when this attempt is made. I'm at a loss currently on how to rectify this. Note, I'm just beginning with hadoop and do not have really any background. I'm working my way through Tom White's, Hadoop The Definitve Guide. Thanks, Mike Freeman P.S. I should mention that my other log files are now error-free (after my manually changing permissions) and I am able to run hadoop commands from the cygwin command line. 2012-01-07 14:42:15,046 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-01-07 14:42:15,203 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-01-07 14:42:15,218 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-01-07 14:42:15,218 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics system started 2012-01-07 14:42:15,656 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-01-07 14:42:15,656 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2012-01-07 14:42:15,859 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2012-01-07 14:42:15,984 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2012-01-07 14:42:16,031 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-01-07 14:42:16,031 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as SYSTEM 2012-01-07 14:42:16,046 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: /tmp/hadoop-SYSTEM/mapred/local/taskTracker to 0755 at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:507) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:630) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1328) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3430) 2012-01-07 14:42:16,046 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker /
Re: has bzip2 compression been deprecated?
Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html. Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If
Using Java Remote Method Invocation to make a UI for Hadoop
Hi, We are trying to make a UI for our HBase + Hadoop applications. Basically, what we want is a web front end that can be hosted on a different server and present data from HBase, as well as allow us to launch Map-Reduce jobs from the web browser. Our current approach is as follows: We have a 6-node development cluster with each node running Scientific Linux in a VM with the following configuration: Node - Hostname - Daemons Node 1 - namenode - namenode, secondarynamenode, regionserver, hbase-master, zookeeper Node 2 - jobtracker - jobtracker, zookeeper Node 3 - slave0 - zookeeper, datanode, tasktracker Node 4 - slave1 - datanode, tasktracker Node 5 - slave2 - datanode, tasktracker Node 6 - slave3 - datanode, tasktracker We have a 7th scientific linux VM running a Grails web application called BillyWeb. We have created several Map-Reduce applications that run on the namenode to process HBase data and populate a results table. Currently, these MR apps are run from the command line on namenode. We use the HBase REST interface to query data from the results table and present it in an AJAX-enabled web page. That is the reading part of the problem solved :) Now we need to have a HTML page with a button, that when clicked will execute an MR app on the namenode. This forms the writing part of the problem. Currently, we are attempting to do this using Java's Remote Method Invocation (RMI) feature. I have successfully created a Java client-server application where the server program runs on namenode and the client runs on BillyWeb successfully. The client program calls a remote object method to trigger the MR job on namenode and gets the result. We are now in the process of integrating it into the Grails webapp, by adding the source code and calling it using groovy taglibs. The only remaining issue we are wrestling with is the need to modify the security policy of the web app to allow access to the client portions of the source code to the remote server. Eventually we hope to have an internal web site that you can visit in any web browser to view custom-made visualisations of the data and execute parameterised MR jobs to process data stored in HBase (amongst other places). We would like to know your thoughts on this approach. In particular, this feels like a potentially convoluted approach to building a front-end that may feature several redundant steps (are we are reinventing too many wheels?). Is there an alternative approach that you can think of that might be more sensible? Can you think of any problems with the approach we are currently taking? Thanks, Tom
Re: dual power for hadoop in datacenter?
Be aware that if half of your cluster goes down, depending of the version and configuration of Hadoop, there may be a replication storm, as hadoop tries to bring it all back up to the proper number of replications. Your cluster may still be unusable in this case. --Bobby Evans On 1/7/12 2:55 PM, Alexander Lorenz wget.n...@googlemail.com wrote: NN, SN and JT must have separated power adapter, for the entire cluster are dual adapter recommend. For HBase and zookeeper servers / regionservers also dual adapters with seperated power lines recommend. - Alex sent via my mobile device On Jan 7, 2012, at 11:23 AM, Koert Kuipers ko...@tresata.com wrote: what are the thoughts on running a hadoop cluster in a datacenter with respect to power? should all the boxes have redundant power supplies and be on dual power? or just dual power for the namenode, secondary namenode, and hbase master, and then perhaps switch the power source per rack for the slaves to provide resilience to a power failure? or even just run everything on single power and accept the risk that everything can do down at once?
Re: has bzip2 compression been deprecated?
Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html. Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM www.sportingindex.com Inbound Email has been scanned for viruses and SPAM ** This email and any
RE: Using Java Remote Method Invocation to make a UI for Hadoop
Thanks Harsh J, I've had a crack at setting up HUE and I now remember why we didn't go for it. It appears that HUE wants to work with Hadoop 0.20 and we are using Hadoop 1.0 and HBase 0.92... I don't suppose there are any other efforts like HUE out there that might be more compatible with our setup? Or can anyone see anything wrong with using Grails and RMI to talk to our namenode? Cheers, Tom From: Harsh J [ha...@cloudera.com] Sent: 09 January 2012 14:39 To: Tom Wilcox Subject: Re: Using Java Remote Method Invocation to make a UI for Hadoop Hey Tom, Just wondering, would http://github.com/cloudera/hue have not helped you at all? On 09-Jan-2012, at 6:59 PM, Tom Wilcox wrote: Hi, We are trying to make a UI for our HBase + Hadoop applications. Basically, what we want is a web front end that can be hosted on a different server and present data from HBase, as well as allow us to launch Map-Reduce jobs from the web browser. Our current approach is as follows: We have a 6-node development cluster with each node running Scientific Linux in a VM with the following configuration: Node - Hostname - Daemons Node 1 - namenode - namenode, secondarynamenode, regionserver, hbase-master, zookeeper Node 2 - jobtracker - jobtracker, zookeeper Node 3 - slave0 - zookeeper, datanode, tasktracker Node 4 - slave1 - datanode, tasktracker Node 5 - slave2 - datanode, tasktracker Node 6 - slave3 - datanode, tasktracker We have a 7th scientific linux VM running a Grails web application called BillyWeb. We have created several Map-Reduce applications that run on the namenode to process HBase data and populate a results table. Currently, these MR apps are run from the command line on namenode. We use the HBase REST interface to query data from the results table and present it in an AJAX-enabled web page. That is the reading part of the problem solved :) Now we need to have a HTML page with a button, that when clicked will execute an MR app on the namenode. This forms the writing part of the problem. Currently, we are attempting to do this using Java's Remote Method Invocation (RMI) feature. I have successfully created a Java client-server application where the server program runs on namenode and the client runs on BillyWeb successfully. The client program calls a remote object method to trigger the MR job on namenode and gets the result. We are now in the process of integrating it into the Grails webapp, by adding the source code and calling it using groovy taglibs. The only remaining issue we are wrestling with is the need to modify the security policy of the web app to allow access to the client portions of the source code to the remote server. Eventually we hope to have an internal web site that you can visit in any web browser to view custom-made visualisations of the data and execute parameterised MR jobs to process data stored in HBase (amongst other places). We would like to know your thoughts on this approach. In particular, this feels like a potentially convoluted approach to building a front-end that may feature several redundant steps (are we are reinventing too many wheels?). Is there an alternative approach that you can think of that might be more sensible? Can you think of any problems with the approach we are currently taking? Thanks, Tom
can't run a simple mapred job
Hi, I have a hbase/hadoop setup on my instance in aws. I am able to run the simple wordcount map reduce example but not a custom one that i wrote. here is the error that i get - [ec2-user@ip-10-68-145-124 bin]$ hadoop jar HBaseTest.jar com.akanksh.information.hbasetest.HBaseSweeper 12/01/09 11:27:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/09 11:27:27 INFO mapred.JobClient: Cleaning up the staging area hdfs://ip-10-68-145-124.ec2.internal:9100/media/ephemeral1/hadoop/mapred/staging/ec2-user/.staging/job_201112151554_0006 Exception in thread main java.lang.RuntimeException: java.lang.InstantiationException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:869) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at com.akanksh.information.hbasetest.HBaseSweeper.main(HBaseSweeper.java:86) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Caused by: java.lang.InstantiationException at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113) ... 14 more does this sound like a standard problem? here is my main method - nothing much in it - public static void main(String args[]) throws Exception { Configuration config = HBaseConfiguration.create(); Job job = new Job(config, HBaseSweeper); job.setJarByClass(HBaseSweeper.class); Scan scan = new Scan(); scan.setCaching(500); scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan, SweeperMapper.class, ImmutableBytesWritable.class, Delete.class, job); job.setOutputFormatClass(FileOutputFormat.class); job.setInputFormatClass(TextInputFormat.class); boolean b = job.waitForCompletion(true); if (!b) { throw new IOException(error with job!); } } can someone help? i will really appreciate it. thanks vinod
Re: has bzip2 compression been deprecated?
Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html . Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no.
Re: increase number of map tasks
Hi Satish What is your value for mapred.max.split.size? Try setting these values as well mapred.min.split.size=0 (it is the default value) mapred.max.split.size=40 Try executing your job once you apply these changes on top of others you did. Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:16 PM, sset satish.se...@hcl.com wrote: Hello, In hdfs we have set block size - 40bytes . Input Data set is as below terminated with line feed. data1 (5*8=40 bytes) data2 .. ... data10 But still we see only 2 map tasks spawned, should have been atleast 10 map tasks. Each mapper performs complex mathematical computation. Not sure how works internally. Line feed does not work. Even with below settings map tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically it should look like compute grid - computation in parallel. property nameio.bytes.per.checksum/name value30/value descriptionThe number of bytes per checksum. Must not be larger than io.file.buffer.size./description /property property namedfs.block.size/name value30/value descriptionThe default block size for new files./description /property property namemapred.tasktracker.map.tasks.maximum/name value10/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property single node with high configuration - 8 cpus and 8gb memory. Hence taking an example of 10 data items with line feeds. We want to utilize full power of machine - hence want at least 10 map tasks - each task needs to perform highly complex mathematical simulation. At present it looks like file data is the only way to specify number of map tasks via splitsize (in bytes) - but I prefer some criteria like line feed or whatever. How do we get 10 map tasks from above configuration - pls help. thanks -- View this message in context: http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33107775.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: has bzip2 compression been deprecated?
Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table? Thanks, Tony -Original Message- From: Bejoy Ks [mailto:bejoy.had...@gmail.com] Sent: 09 January 2012 17:33 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html . Has bzip2 support been removed from Hadoop, or will it be removed soon? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor
connection between slaves and master
Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
Netstat Shows Port 8020 Doesn't Seem to Listen
Hi, I've been googling, but haven't been able to find an answer. I'm currently trying to setup Hadoop in pseudo-distributed mode as a first step. I'm using the Cloudera distro and installed everything through YUM on CentOS 5.7. I can run everything just fine from my one node itself (hadoop fs -ls /, test map-red jobs, etc...), but can't get a remote client to be able to connect to it. I'm pretty sure the cause of that has to do with the fact that port 8020 and port 8021 do not seem to be listening (when I do a netstat -a, they don't show up-- all the other Hadoop related ports like 50030 and 50070 do show up). I verified that the firewall allows connections over 8020 and 8021 for tcp, and can connect through my web browser to 50030 and 50070. Looking at the namenode log, I see the following error which looks suspicious and related to me: 2012-01-09 12:03:38,000 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting 2012-01-09 12:03:38,009 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: starting 2012-01-09 12:03:39,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 8020: starting 2012-01-09 12:03:39,246 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020: starting 2012-01-09 12:03:39,258 WARN org.apache.hadoop.util.PluginDispatcher: Unable to load dfs.namenode.plugins plugins 2012-01-09 12:03:40,254 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 8020, call addBlock(/var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info, DFSClient_-1779116177, null) from 127.0.0.1:39785: error: java.io.IOException: File /var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Anyone have any idea what my problem might be? Cheers, Eli
Adding a soft-linked archive file to the distributed cache doesn't work as advertised
I am trying to add a zip file to the distributed cache and have it unzipped on the task nodes with a softlink to the unzipped directory placed in the working directory of my mapper process. I think I'm doing everything the way the documentation tells me to, but it's not working. On the client in the run() function while I'm creating the job I first call: fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip); As expected, this copies the archive file gate-app.zip to the HDFS directory /tmp. Then I call DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app, configuration); I expect this to add /tmp/gate-app.zip to the distributed cache and put a softlink to it called gate-app in the working directory of each task. However, when I call job.waitForCompletion(), I see the following error: Exception in thread main java.io.FileNotFoundException: File does not exist: /tmp/gate-app.zip#gate-app. It appears that the distributed cache mechanism is interpreting the entire URI as the literal name of the file, instead of treating the fragment as the name of the softlink. As far as I can tell, I'm doing this correctly according to the API documentation: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . The full project in which I'm doing this is up on github: https://github.com/wpm/Hadoop-GATE. Can someone tell me what I'm doing wrong?
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. On 1/9/12 1:14 PM, Eli Finkelshteyn wrote: Hi, I've been googling, but haven't been able to find an answer. I'm currently trying to setup Hadoop in pseudo-distributed mode as a first step. I'm using the Cloudera distro and installed everything through YUM on CentOS 5.7. I can run everything just fine from my one node itself (hadoop fs -ls /, test map-red jobs, etc...), but can't get a remote client to be able to connect to it. I'm pretty sure the cause of that has to do with the fact that port 8020 and port 8021 do not seem to be listening (when I do a netstat -a, they don't show up-- all the other Hadoop related ports like 50030 and 50070 do show up). I verified that the firewall allows connections over 8020 and 8021 for tcp, and can connect through my web browser to 50030 and 50070. Looking at the namenode log, I see the following error which looks suspicious and related to me: 2012-01-09 12:03:38,000 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting 2012-01-09 12:03:38,009 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: starting 2012-01-09 12:03:39,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020: starting 2012-01-09 12:03:39,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: starting 2012-01-09 12:03:39,189 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 8020: starting 2012-01-09 12:03:39,246 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020: starting 2012-01-09 12:03:39,258 WARN org.apache.hadoop.util.PluginDispatcher: Unable to load dfs.namenode.plugins plugins 2012-01-09 12:03:40,254 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 8020, call addBlock(/var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info, DFSClient_-1779116177, null) from 127.0.0.1:39785: error: java.io.IOException: File /var/lib/hadoop-0.20/cache/mapred/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 Anyone have any idea what my problem might be? Cheers, Eli
Re: has bzip2 compression been deprecated?
Hi Tony As I understand your requirement, your mapreduce job produces a Sequence File as ouput and you need to use this file as an input to hive table. When you CREATE and EXTERNAL Table in hive you specify a location where your data is stored and also what is the format of that data( like the field delimiter,row delimiter, file type etc of your data). You are actually not loading data any where when you create a hive external table(issue DDL), just specifying where the data lies in file system in fact there is not even any validation performed that time to check on the data quality. When you Query/Retrive your data through Hive QLs the parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s). In short STORED AS refer to the type of files that a table's data directory holds. For details https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable Hope it helps!.. Regards Bejoy.K.S On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton tbur...@sportingindex.comwrote: Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table? Thanks, Tony -Original Message- From: Bejoy Ks [mailto:bejoy.had...@gmail.com] Sent: 09 January 2012 17:33 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best
Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised
Bill, In addition you must call DistributedCached.createSymlink(configuration), that should do. Thxs. Alejandro On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote: I am trying to add a zip file to the distributed cache and have it unzipped on the task nodes with a softlink to the unzipped directory placed in the working directory of my mapper process. I think I'm doing everything the way the documentation tells me to, but it's not working. On the client in the run() function while I'm creating the job I first call: fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip); As expected, this copies the archive file gate-app.zip to the HDFS directory /tmp. Then I call DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app, configuration); I expect this to add /tmp/gate-app.zip to the distributed cache and put a softlink to it called gate-app in the working directory of each task. However, when I call job.waitForCompletion(), I see the following error: Exception in thread main java.io.FileNotFoundException: File does not exist: /tmp/gate-app.zip#gate-app. It appears that the distributed cache mechanism is interpreting the entire URI as the literal name of the file, instead of treating the fragment as the name of the softlink. As far as I can tell, I'm doing this correctly according to the API documentation: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . The full project in which I'm doing this is up on github: https://github.com/wpm/Hadoop-GATE. Can someone tell me what I'm doing wrong?
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised
I added a DistributedCache.createSymlink(configuration) call right after the addCacheArcihve() call, but see the same error. On Mon, Jan 9, 2012 at 11:05 AM, Alejandro Abdelnur t...@cloudera.comwrote: Bill, In addition you must call DistributedCached.createSymlink(configuration), that should do. Thxs. Alejandro On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote: I am trying to add a zip file to the distributed cache and have it unzipped on the task nodes with a softlink to the unzipped directory placed in the working directory of my mapper process. I think I'm doing everything the way the documentation tells me to, but it's not working. On the client in the run() function while I'm creating the job I first call: fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip); As expected, this copies the archive file gate-app.zip to the HDFS directory /tmp. Then I call DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app, configuration); I expect this to add /tmp/gate-app.zip to the distributed cache and put a softlink to it called gate-app in the working directory of each task. However, when I call job.waitForCompletion(), I see the following error: Exception in thread main java.io.FileNotFoundException: File does not exist: /tmp/gate-app.zip#gate-app. It appears that the distributed cache mechanism is interpreting the entire URI as the literal name of the file, instead of treating the fragment as the name of the softlink. As far as I can tell, I'm doing this correctly according to the API documentation: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . The full project in which I'm doing this is up on github: https://github.com/wpm/Hadoop-GATE. Can someone tell me what I'm doing wrong?
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Eli, What is your fs.default.name set to, in core-site.xml? On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyn iefin...@gmail.com wrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules. -- Harsh J
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyn iefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
RE: has bzip2 compression been deprecated?
Out of curiousity, when hive records are compressed, how large is a typical compressed record? Do you have issues where the block size is too small to be compressed efficiently? More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe. - Tim. From: Tony Burton [tbur...@sportingindex.com] Sent: Monday, January 09, 2012 10:02 AM To: common-user@hadoop.apache.org Subject: RE: has bzip2 compression been deprecated? Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table? Thanks, Tony -Original Message- From: Bejoy Ks [mailto:bejoy.had...@gmail.com] Sent: 09 January 2012 17:33 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this. The best way would be to use either: (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project. (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added. On 09-Jan-2012, at 7:17 PM, Tony Burton wrote: Hi, I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed. However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html . Has bzip2 support been removed from
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Thanks for the help, Idris. I checked all the confs you mentioned, and all is as it should be. jps gives me: 24226 Jps 24073 TaskTracker 23854 JobTracker 23780 DataNode 23921 NameNode 23995 SecondaryNameNode So that looks good. A majority of this stuff is default as set by Cloudera. Any other ideas? Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
RE: has bzip2 compression been deprecated?
I thought it was optional whether hive stored blocks (up to 1MB?) or records. If records, it's not storing individual records? Am I misunderstanding? Maybe I should get off my lazy butt and just check the source code... ;^) - Tim. From: bejoy.had...@gmail.com [bejoy.had...@gmail.com] Sent: Monday, January 09, 2012 1:22 PM To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tim When you say in hive a table data is compressed by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS. When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage. Hope it helps!.. Regards Bejoy K S -Original Message- From: Tim Broberg tim.brob...@exar.com Date: Mon, 9 Jan 2012 12:27:47 To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: RE: has bzip2 compression been deprecated? Out of curiousity, when hive records are compressed, how large is a typical compressed record? Do you have issues where the block size is too small to be compressed efficiently? More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe. - Tim. From: Tony Burton [tbur...@sportingindex.com] Sent: Monday, January 09, 2012 10:02 AM To: common-user@hadoop.apache.org Subject: RE: has bzip2 compression been deprecated? Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table? Thanks, Tony -Original Message- From: Bejoy Ks [mailto:bejoy.had...@gmail.com] Sent: 09 January 2012 17:33 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+). To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well.
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
Re: datanode failing to start
gave up and installed version 1. it installed correctly and worked, thought the instructions for setup and the location of scripts and configs are now out of date. D On 1/5/2012 10:25 AM, Dave Kelsey wrote: java version 1.6.0_29 hadoop: 0.20.203.0 I'm attempting to setup the pseudo-distributed config on a mac 10.6.8. I followed the steps from the QuickStart (http://wiki.apache.org./hadoop/QuickStart) and succeeded with Stage 1: Standalone Operation. I followed the steps for Stage 2: Pseudo-distributed Configuration. I set the JAVA_HOME variable in conf/hadoop-env.sh and I changed tools.jar to the location of classes.jar (a mac version of tools.jar) I've modified the three .xml files as described in the QuickStart. ssh'ing to localhost has been configured and works with passwordless authentication. I formatted the namenode with bin/hadoop namenode -format as the instructions say This is what I see when I run bin/start-all.sh root# bin/start-all.sh starting namenode, logging to /Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-namenode-Hoot-2.local.out localhost: starting datanode, logging to /Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-datanode-Hoot-2.local.out localhost: Exception in thread main java.lang.NoClassDefFoundError: server localhost: Caused by: java.lang.ClassNotFoundException: server localhost: at java.net.URLClassLoader$1.run(URLClassLoader.java:202) localhost: at java.security.AccessController.doPrivileged(Native Method) localhost: at java.net.URLClassLoader.findClass(URLClassLoader.java:190) localhost: at java.lang.ClassLoader.loadClass(ClassLoader.java:306) localhost: at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) localhost: at java.lang.ClassLoader.loadClass(ClassLoader.java:247) localhost: starting secondarynamenode, logging to /Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-secondarynamenode-Hoot-2.local.out starting jobtracker, logging to /Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-jobtracker-Hoot-2.local.out localhost: starting tasktracker, logging to /Users/admin/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-root-tasktracker-Hoot-2.local.out There are 4 processes running: ps -fax | grep hadoop | grep -v grep | wc -l 4 They are: SecondaryNameNode TaskTracker NameNode JobTracker I've searched to see if anyone else has encountered this and not found anything d p.s. I've also posted this to core-u...@hadoop.apache.org which I've yet to find how to subscribe to.
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Good call! netstat -anl gives me: tcp0 0 :::127.0.0.1:8020 :::*LISTEN Now it just looks like nothing is running on 8021. And now I'm really confused about why I get no communication over 8020 from the datanode. Just to reiterate, this definitely is not the firewall, running iptables -nvL gives: ... 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:50070 164 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:50030 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:8021 164 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:8020 ... On 1/9/12 5:08 PM, alo.alt wrote: What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
RE: has bzip2 compression been deprecated?
Based on this, it seems like the best approach is just to pick block compression rather than record compression, presumeably for this very reason. https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation Perhaps record compression is the default to prioritize speed... - Tim. From: Tim Broberg [tim.brob...@exar.com] Sent: Monday, January 09, 2012 1:42 PM To: common-user@hadoop.apache.org; bejoy.had...@gmail.com Subject: RE: has bzip2 compression been deprecated? I thought it was optional whether hive stored blocks (up to 1MB?) or records. If records, it's not storing individual records? Am I misunderstanding? Maybe I should get off my lazy butt and just check the source code... ;^) - Tim. From: bejoy.had...@gmail.com [bejoy.had...@gmail.com] Sent: Monday, January 09, 2012 1:22 PM To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tim When you say in hive a table data is compressed by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS. When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage. Hope it helps!.. Regards Bejoy K S -Original Message- From: Tim Broberg tim.brob...@exar.com Date: Mon, 9 Jan 2012 12:27:47 To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: RE: has bzip2 compression been deprecated? Out of curiousity, when hive records are compressed, how large is a typical compressed record? Do you have issues where the block size is too small to be compressed efficiently? More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe. - Tim. From: Tony Burton [tbur...@sportingindex.com] Sent: Monday, January 09, 2012 10:02 AM To: common-user@hadoop.apache.org Subject: RE: has bzip2 compression been deprecated? Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table? Thanks, Tony -Original Message- From: Bejoy Ks [mailto:bejoy.had...@gmail.com] Sent: 09 January 2012 17:33 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Hi Tony Adding on to Harsh's comments. If you want the generated sequence files to be utilized by a hive table. Define your hive table as CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING) ... ... STORED AS SEQUENCEFILE; Regards Bejoy.K.S On Mon, Jan 9, 2012 at 10:32 PM, alo.alt wget.n...@googlemail.com wrote: Tony, snappy is also available: http://code.google.com/p/hadoop-snappy/ best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 8:49 AM, Harsh J wrote: Tony, * Yeah, SequenceFiles aren't human-readable, but fs -text can read it out (instead of a plain fs -cat). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format. * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose. * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC). On 09-Jan-2012, at 9:44 PM, Tony Burton wrote: Thanks for the quick reply and the clarification about the documentation. Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? Tony -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: 09 January 2012 15:34 To: common-user@hadoop.apache.org Subject: Re: has bzip2 compression been deprecated? Bzip2 is
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
Firewall online? and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug) - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote: Good call! netstat -anl gives me: tcp0 0 :::127.0.0.1:8020 :::* LISTEN Now it just looks like nothing is running on 8021. And now I'm really confused about why I get no communication over 8020 from the datanode. Just to reiterate, this definitely is not the firewall, running iptables -nvL gives: ... 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50070 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50030 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8021 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8020 ... On 1/9/12 5:08 PM, alo.alt wrote: What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
OK, not sure what I did (restarting the firewall, perhaps?), but I now have ports 8020 and 8021 listening and no more errors in my logs. Wooo! Only problem is I still can't get any hadoop stuff to work from a remote client: hadoop fs -ls / 2012-01-09 17:53:53.559 java[13396:1903] Unable to load realm info from SCDynamicStore 12/01/09 17:53:55 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 0 time(s). 12/01/09 17:53:56 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 1 time(s). 12/01/09 17:53:57 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 2 time(s). ... I feel like I'm almost there. Might this have to do with the fact that core-site.xml and mapred-site.xml specify localhost for ports 8020 and 8021 (thus not listening to any attempted outside connections?) Thanks for all the help so far, everyone! Eli On 1/9/12 5:43 PM, alo.alt wrote: Firewall online? and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug) - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote: Good call! netstat -anl gives me: tcp0 0 :::127.0.0.1:8020 :::* LISTEN Now it just looks like nothing is running on 8021. And now I'm really confused about why I get no communication over 8020 from the datanode. Just to reiterate, this definitely is not the firewall, running iptables -nvL gives: ... 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50070 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50030 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8021 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8020 ... On 1/9/12 5:08 PM, alo.alt wrote: What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just
RE: Netstat Shows Port 8020 Doesn't Seem to Listen
One of the option to troubleshoot is to run open your url in the lynx, on the command line it will be not be affected by any firewall and will be a local access... Regards, Vivek -Original Message- From: Eli Finkelshteyn [mailto:iefin...@gmail.com] Sent: Monday, January 09, 2012 2:39 PM To: common-user@hadoop.apache.org Subject: Re: Netstat Shows Port 8020 Doesn't Seem to Listen Good call! netstat -anl gives me: tcp0 0 :::127.0.0.1:8020 :::*LISTEN Now it just looks like nothing is running on 8021. And now I'm really confused about why I get no communication over 8020 from the datanode. Just to reiterate, this definitely is not the firewall, running iptables -nvL gives: ... 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:50070 164 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:50030 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:8021 164 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:8020 ... On 1/9/12 5:08 PM, alo.alt wrote: What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote machine). This problem is something else. On 1/9/12 2:31 PM, zGreenfelder wrote: On Mon, Jan 9, 2012 at 1:58 PM, Eli Finkelshteyniefinkel@gmail.**comiefin...@gmail.com wrote: More info: In the DataNode log, I'm also seeing: 2012-01-09 13:06:27,751 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s). Why would things just not load on port 8020? I feel like all the errors I'm seeing are caused by this, but I can't see any errors about why this occurred in the first place. are you sure there isn't a firewall in place blocking port 8020? e.g. iptables on the local machines? if you do telnet localhost 8020 do you make a connection? if you use lsof and/or netstat can you see the port open? if you have root access you can try turning off the firewall with iptables -F to see if things work without firewall rules.
Re: Netstat Shows Port 8020 Doesn't Seem to Listen
OK, switched over to my site's dns name and I'm golden. Everything works both locally and remotely. My suspicion is that the problem all along was tied to the firewall not being restarted, and then my exacerbating the problem through trying to fix it and corrupting the hdfs file system. Anyway, all works now. Thanks for the help, everyone! Eli On 1/9/12 6:11 PM, Eli Finkelshteyn wrote: OK, not sure what I did (restarting the firewall, perhaps?), but I now have ports 8020 and 8021 listening and no more errors in my logs. Wooo! Only problem is I still can't get any hadoop stuff to work from a remote client: hadoop fs -ls / 2012-01-09 17:53:53.559 java[13396:1903] Unable to load realm info from SCDynamicStore 12/01/09 17:53:55 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 0 time(s). 12/01/09 17:53:56 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 1 time(s). 12/01/09 17:53:57 INFO ipc.Client: Retrying connect to server: *my_server/my_ip*:8020. Already tried 2 time(s). ... I feel like I'm almost there. Might this have to do with the fact that core-site.xml and mapred-site.xml specify localhost for ports 8020 and 8021 (thus not listening to any attempted outside connections?) Thanks for all the help so far, everyone! Eli On 1/9/12 5:43 PM, alo.alt wrote: Firewall online? and be sure that in /etc/hosts ONLY 127.0.0.1 is linked to localhost. Nothing like YOURHOSTNAME.YOURDOMAIN (Redhat kudzu bug) - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:39 PM, Eli Finkelshteyn wrote: Good call! netstat -anl gives me: tcp0 0 :::127.0.0.1:8020 :::* LISTEN Now it just looks like nothing is running on 8021. And now I'm really confused about why I get no communication over 8020 from the datanode. Just to reiterate, this definitely is not the firewall, running iptables -nvL gives: ... 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50070 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:50030 0 0 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8021 164 ACCEPT tcp -- * * 0.0.0.0/00.0.0.0/0 state NEW tcp dpt:8020 ... On 1/9/12 5:08 PM, alo.alt wrote: What happen when you try a telnet localhost 8020? netstat -anl would also useful. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 9, 2012, at 2:02 PM, Eli Finkelshteyn wrote: A bit more info: When I start up only the namenode by itself, I'm not seeing any errors, but what I am seeing that's really odd is: 2012-01-09 16:48:45,530 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2012-01-09 16:48:45,531 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,532 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2012-01-09 16:48:45,541 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:8020 That's despite the fact that doing netstat -a | grep 8020 still returns nothing. To me, that makes absolutely no sense. I feel like I should be getting an error telling me Namenode did not in fact go up on 8020, but I'm not getting that at all. Eli On 1/9/12 3:22 PM, Idris Ali wrote: Hi, Looks like problem in starting DFS and MR, can you run 'jps' and see if NN, DN, SNN, JT and TT are running, also make sure for pseudo-distributed mode, the following entries are present: 1. In core-site.xml property namefs.default.name/name valuehdfs://localhost:8020/value /property property namehadoop.tmp.dir/name valueSOME TMP dir with Read/Write acces not system temp/value /property property 2. In hdfs-site.xml property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property property !-- specify this so that running 'hadoop namenode -format' formats the right dir -- namedfs.name.dir/name valueLocal dir with Read/Write access/value /property 3. In mapred-stie.xml property namemapred.job.tracker/name valuelocalhost:8021/value /property Thanks, -Idris On Tue, Jan 10, 2012 at 1:07 AM, Eli Finkelshteyniefin...@gmail.comwrote: Positive. Like I said before, netstat -a | grep 8020 gives me nothing. Even if the firewall was the problem, that should still give me output that the port is listening, but I'd just be unable to hit it from an outside box (I tested this by blocking port 50070, at which point it still showed up in netstat -a, but was inaccessible through http from a remote
Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hi, Thanks for your reply! I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible? regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-09 23:19:21 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hi, Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster. On 09-Jan-2012, at 5:50 PM, hao.wang wrote: Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang
Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hello again, Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker). On 10-Jan-2012, at 8:17 AM, hao.wang wrote: Hi, Thanks for your reply! I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible? regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-09 23:19:21 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hi, Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster. On 09-Jan-2012, at 5:50 PM, hao.wang wrote: Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang
Re: datanode failing to start
Can you please send your notes on what info is out of date or better still create a jira so that it can be addressed. On Fri, Jan 6, 2012 at 3:11 PM, Dave Kelsey da...@gamehouse.com wrote: gave up and installed version 1. it installed correctly and worked, thought the instructions for setup and the location of scripts and configs are now out of date. D On 1/5/2012 10:25 AM, Dave Kelsey wrote: java version 1.6.0_29 hadoop: 0.20.203.0 I'm attempting to setup the pseudo-distributed config on a mac 10.6.8. I followed the steps from the QuickStart (http://wiki.apache.org./** hadoop/QuickStart http://wiki.apache.org./hadoop/QuickStart) and succeeded with Stage 1: Standalone Operation. I followed the steps for Stage 2: Pseudo-distributed Configuration. I set the JAVA_HOME variable in conf/hadoop-env.sh and I changed tools.jar to the location of classes.jar (a mac version of tools.jar) I've modified the three .xml files as described in the QuickStart. ssh'ing to localhost has been configured and works with passwordless authentication. I formatted the namenode with bin/hadoop namenode -format as the instructions say This is what I see when I run bin/start-all.sh root# bin/start-all.sh starting namenode, logging to /Users/admin/hadoop/hadoop-0.** 20.203.0/bin/../logs/hadoop-**root-namenode-Hoot-2.local.out localhost: starting datanode, logging to /Users/admin/hadoop/hadoop-0.** 20.203.0/bin/../logs/hadoop-**root-datanode-Hoot-2.local.out localhost: Exception in thread main java.lang.**NoClassDefFoundError: server localhost: Caused by: java.lang.**ClassNotFoundException: server localhost: at java.net.URLClassLoader$1.run(** URLClassLoader.java:202) localhost: at java.security.**AccessController.doPrivileged(**Native Method) localhost: at java.net.URLClassLoader.** findClass(URLClassLoader.java:**190) localhost: at java.lang.ClassLoader.**loadClass(ClassLoader.java:** 306) localhost: at sun.misc.Launcher$**AppClassLoader.loadClass(** Launcher.java:301) localhost: at java.lang.ClassLoader.**loadClass(ClassLoader.java:** 247) localhost: starting secondarynamenode, logging to /Users/admin/hadoop/hadoop-0.**20.203.0/bin/../logs/hadoop-** root-secondarynamenode-Hoot-2.**local.out starting jobtracker, logging to /Users/admin/hadoop/hadoop-0.** 20.203.0/bin/../logs/hadoop-**root-jobtracker-Hoot-2.local.**out localhost: starting tasktracker, logging to /Users/admin/hadoop/hadoop-0. **20.203.0/bin/../logs/hadoop-**root-tasktracker-Hoot-2.local.**out There are 4 processes running: ps -fax | grep hadoop | grep -v grep | wc -l 4 They are: SecondaryNameNode TaskTracker NameNode JobTracker I've searched to see if anyone else has encountered this and not found anything d p.s. I've also posted this to core-u...@hadoop.apache.org which I've yet to find how to subscribe to.
Container launch from appmaster
Hi all, I am trying to write an application master.Is there a way to specify node1: 10 conatiners node2: 10 containers Can we specify this kind of list using the application master
Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hi , Thanks for your reply! According to your suggestion, Maybe I can't apply it to our hadoop cluster. Cus, each server in our hadoop cluster just contains 2 CPUs. So, I think maybe you mean the core # but not CPU # in each searver? I am looking for your reply. regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-10 11:33:38 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hello again, Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker). On 10-Jan-2012, at 8:17 AM, hao.wang wrote: Hi, Thanks for your reply! I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible? regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-09 23:19:21 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hi, Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster. On 09-Jan-2012, at 5:50 PM, hao.wang wrote: Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang