Re: help with Hadoop custom Writable types implementation
Write: first write the length of your list, and then write the item in list one by one Read: read the length of your list, initialized your list and read the item one by one, and append the item to list I suggest you use array instead of list On Thu, Oct 15, 2009 at 12:01 PM, z3r0c001 ceo.co...@gmail.com wrote: I'm trying to implement Writable interface. but not sure how to serialize/write/read data from nested objects in public class StorageClass implements Writable{ public String xStr; public String yStr; public ListSubStorage sStor //omitted ctors @override public void write(DataOutput out) throws IOException{ out.writeChars(xStr); out.WriteChars(yStr); //WHAT SHOULD I DO FOR ListSubStorage } @override public void readFields(DataInput in) throws IOException{ xStr = in.readLine(); yStr = in.readLine(); //WHAT SHOULD I DO FOR ListSubStorage } } public class SubStorage{ public String x; public String y; }
Re: How to get IP address of the machine where map task runs
InetAddress.getLocalHost() should give you that. If you are planning to make some decisions based on this, please do account for conditions arising from speculative executions ( they caused me some amount of trouble when I was designing my app ) Thanks, Amogh On 10/15/09 8:15 AM, Long Van Nguyen Dinh munt...@gmail.com wrote: Thanks Amogh. For my application, I want each map task reports to me where it's running. However, I have no idea how to use Java Inetaddress APIs to get that info. Could you explain more? Van On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar am...@yahoo-inc.com wrote: For starters look at any monitoring tool like vaidya, hadoop UI ( ganglia too, haven't read much on it though ). Not sure if you need this for debugging purposes or for some other real-time app.. You should be able to get info on localhost of each of your map tasks in a pretty straightforward way using Java Inetaddress APIs( and use that info for search?) Thanks, Amogh On 10/15/09 12:11 AM, Long Van Nguyen Dinh munt...@gmail.com wrote: Hello again, Could you give me any hint to start with? I have no idea how to get that information. Many thanks, Van On Tue, Oct 13, 2009 at 9:22 PM, Long Van Nguyen Dinh munt...@gmail.com wrote: Hi all, Given a map task, I need to know the IP address of the machine where that task is running. Is there any existing method to get that information? Thank you, Van
Re: Need Info
Dear Shwitzu The steps are listed below: Kindly go through wordcount and multifile word count for you project. Modify the program to list the files containing the keywords along with fine names. Use file names as keys. Store the files in 4 different input directories – one for each file type if needed. Else you can also have it in a single input directory. Use word count example with extensions suggested to retrieve file names having the keywords and store the result in output directory or display the links. Map – parallelized reading of multiple files – Input key-value pair is filename–filecontents Output key-value pair is filename – keyword and count. Reduce – combining output from key-value pairs of map function Input key-value pair is filename – keyword and count. Output key-value pairs is keyword – filenames having the keywords The answers to your questions are: 1) How should I start with the design? Identify the files to be saved in the HDFS input disrectory. Go through the word count example. 2) Upload all the files and create Map, Reduce and Driver code and once I run my application will it automatically go the file system and get back the results to me? Move all the files from local file system to HDFS / save it directly to HDFS by using suitable DFS command like copyfromlocal() - Go through DFS commands 3) How do i handle the binary data? I want to store binary format data using MTOM in my databse. It can be handled in the same way as a conventional file G Sudha Sadasivam t...@gmail.com wrote: From: shwitzu shwi...@gmail.com Subject: Need Info To: core-u...@hadoop.apache.org Date: Thursday, October 15, 2009, 7:19 AM Hello Sir! I am new to hadoop. I have a project based on webservices. I have my information in 4 databases with different files in each one of them. Say, images in one, video, documents etc. My task is to develop a web service which accepts the keyword from the client and process the request and send back the actual requested file back to the user. Now I have to use Hadoop distributed file system in this project. I have the following questions: 1) How should I start with the design? 2) Should I upload all the files and create Map, Reduce and Driver code and once I run my application will it automatically go the file system and get back the results to me? 3) How do i handle the binary data? I want to store binary format data using MTOM in my databse. Please let me know how I should proceed. I dont know much about this hadoop and am I searching for some help. It would be great if you could assist me. Thanks again -- View this message in context: http://www.nabble.com/Need-Info-tp25901902p25901902.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Error register getProtocolVersion
Thanks for the info Todd, I have also seen this error, but it's not fatal. Ok, I'll ignore it for the time being. Is this log from just the NameNode or did you tail multiple logs? It seems odd that your namenode would be trying to make an IPC client to itself (port 8020). Just the name node. After you see these messages, does your namenode shut down? It kind of just hangs really (Already tried 9 times, Already tried 10 times etc) Does jps show it running? Is the web interface available at port 50070? No, it never is available. I'll keep hunting for some config error. Cheers Tim -Todd On Wed, Oct 14, 2009 at 2:33 AM, tim robertson timrobertson...@gmail.com wrote: Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the right direction for diagnosing where I have gone wrong? Thanks Tim / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cluma.gbif.org/192.38.28.77 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2009-10-14 11:09:53,013 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: cluma.gbif.clu/192.168.76.254:8020 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-10-14 11:09:53,019 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,056 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-10-14 11:09:53,064 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,065 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-10-14 11:09:53,090 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 loaded in 0 seconds. 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds. 2009-10-14 11:09:53,098 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 saved in 0 seconds. 2009-10-14 11:09:53,113 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 72 msecs 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of under-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of over-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 0 secs. 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2009-10-14 11:09:53,242 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50070 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 50070 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50070 2009-10-14 11:09:53,243 INFO org.mortbay.log: jetty-6.1.14 2009-10-14 11:09:53,433 INFO org.mortbay.log: Started
Re: Error register getProtocolVersion
Hi Tim, jps should show it running if it's hung. In that case, see if you can get a stack trace using jstack pid. Paste that trace and we may be able to figure out where it's hung up. -Todd On Thu, Oct 15, 2009 at 12:12 AM, tim robertson timrobertson...@gmail.com wrote: Thanks for the info Todd, I have also seen this error, but it's not fatal. Ok, I'll ignore it for the time being. Is this log from just the NameNode or did you tail multiple logs? It seems odd that your namenode would be trying to make an IPC client to itself (port 8020). Just the name node. After you see these messages, does your namenode shut down? It kind of just hangs really (Already tried 9 times, Already tried 10 times etc) Does jps show it running? Is the web interface available at port 50070? No, it never is available. I'll keep hunting for some config error. Cheers Tim -Todd On Wed, Oct 14, 2009 at 2:33 AM, tim robertson timrobertson...@gmail.com wrote: Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the right direction for diagnosing where I have gone wrong? Thanks Tim / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cluma.gbif.org/192.38.28.77 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2009-10-14 11:09:53,013 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: cluma.gbif.clu/192.168.76.254:8020 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-10-14 11:09:53,019 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,056 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-10-14 11:09:53,064 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,065 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-10-14 11:09:53,090 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 loaded in 0 seconds. 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds. 2009-10-14 11:09:53,098 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 saved in 0 seconds. 2009-10-14 11:09:53,113 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 72 msecs 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of under-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of over-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 0 secs. 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2009-10-14 11:09:53,242 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50070 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50070
Re: Error register getProtocolVersion
2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) I have got this error when i start my mapreduce but everything seems OK and i can still running my mapreduce job Any suggestion to deal with this error? 2009/10/15 Todd Lipcon t...@cloudera.com Hi Tim, jps should show it running if it's hung. In that case, see if you can get a stack trace using jstack pid. Paste that trace and we may be able to figure out where it's hung up. -Todd On Thu, Oct 15, 2009 at 12:12 AM, tim robertson timrobertson...@gmail.com wrote: Thanks for the info Todd, I have also seen this error, but it's not fatal. Ok, I'll ignore it for the time being. Is this log from just the NameNode or did you tail multiple logs? It seems odd that your namenode would be trying to make an IPC client to itself (port 8020). Just the name node. After you see these messages, does your namenode shut down? It kind of just hangs really (Already tried 9 times, Already tried 10 times etc) Does jps show it running? Is the web interface available at port 50070? No, it never is available. I'll keep hunting for some config error. Cheers Tim -Todd On Wed, Oct 14, 2009 at 2:33 AM, tim robertson timrobertson...@gmail.com wrote: Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the right direction for diagnosing where I have gone wrong? Thanks Tim / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cluma.gbif.org/192.38.28.77 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2009-10-14 11:09:53,013 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: cluma.gbif.clu/192.168.76.254:8020 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-10-14 11:09:53,019 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,056 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-10-14 11:09:53,064 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,065 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-10-14 11:09:53,090 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 loaded in 0 seconds. 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds. 2009-10-14 11:09:53,098 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 saved in 0 seconds. 2009-10-14 11:09:53,113 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 72 msecs 2009-10-14 11:09:53,114 INFO
Re: Hardware performance from HADOOP cluster
Hi Todd, Some changes have been applied to the cluster based on the documentation (URL) you noted below, like file descriptor settings and io.file.buffer.size. I will check out the other settings which I haven't applied yet. My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml on all nodes in the cluster. _*hadoop-site.xml *_mapred.tasktracker.task.maximum = 2 mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.reduce.tasks.maximum = 8 _* hadoop-default.xml *_mapred.map.tasks = 2 mapred.reduce.tasks = 1 Thanks, Usman This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and reduce slots did you configure for the daemons? If you have Ganglia installed you can usually get a good idea of whether you're using your resources well by looking at the graphs while running a job like this sort. -Todd On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote: Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GENERATED DATA Map output records=4088301 Reduce input records=4088301 The job took 2136 seconds. (approximately: 35 minutes). VALIDATION OF SORTED DATA The job took 183 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully. It would be interesting to see what performance numbers others with a similar setup have obtained. Thanks, Usman I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote: Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sort and posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote: Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. Thanks, Usman
Re: detecting stalled daemons?
Edward Capriolo wrote: I know there is a Jira open to add life cycle methods to each hadoop component that can be polled for progress. I dont know the # off hand. HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its own branch. This is still something I'm working on, the code works, all the tests work, but there are some quirks with JobTracker startup now that it blocks waiting for the filesystem to come up that I'm not happy with; I need to add some new tests/mechanisms to shut down a service while it is still starting up, which includes interrupting the JT and TT. You can get RPMs with all this stuff packaged up for use from http://smartfrog.org/ , with the caveat that it's still fairly unstable. I am currently work on the other side of the equation, integration with multiple cloud infrastructures, with all the fun testing issues that follow: http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0 * The simplest liveness test for any of the workers right now is to hit their HTTP pages, its the classic happy test. We can and should extend this with more self-tests, some equivalent of Axis's happy.jsp. The nice thing about these is they integrate well with all the existing web page monitoring tools, though I should warn that the same tooling that tracks and reports the health of a four-way app server doesn't really scale to keeping an eye on 3000 task trackers. It's not the monitoring, but the reporting. * Detecting failures of TTs and DNs is kind of tricky too; it's really the namenode and jobtracker that know best. We need to get some reporting in there so that when either of the masters think that one of their workers is playing up, they report it to whatever plugin wants to know. * Handling failures of VMs is very different from physical machines. You just kill the VM and restart a new one. We don't need all the blacklisting stuff, just some infrastructure operations and various notifications to the ops team. -steve
Can't I use hive? What does this exception mean?
I have already installed hadoop correctly. but what does that mean? ~/hive/build/dist/bin$ ./hive Hive history file=/tmp/yangzhuoluo/hive_job_log_yangzhuoluo_200910151919_995767046.txt hive show tables; FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Can anybody tell me what does the exception mean? Thank you all very much. ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
My Hadoop World Report
Hi all, My Hadoop World Report has got 4th position in page view ranking of ITmedia who's a very large Japanese media rather than CNet in here. Please scroll down and view the right pain of below URL. You can find the key word Hadoop in list. However, it's a Japanese content, sorry. http://www.itmedia.co.jp/enterprise/articles/0910/15/news011.html /mikio uzawa
Re: Error register getProtocolVersion
Hi Todd, Brian Thanks for your comments. I got to the bottom of the problem. A netstat showed me that all the DNs could connect to the NN but the firewall was stopping the NN complete the http handshake to itself. It's all up and running now. Thanks, Tim On Thu, Oct 15, 2009 at 10:23 AM, Todd Lipcon t...@cloudera.com wrote: Hi Tim, jps should show it running if it's hung. In that case, see if you can get a stack trace using jstack pid. Paste that trace and we may be able to figure out where it's hung up. -Todd On Thu, Oct 15, 2009 at 12:12 AM, tim robertson timrobertson...@gmail.com wrote: Thanks for the info Todd, I have also seen this error, but it's not fatal. Ok, I'll ignore it for the time being. Is this log from just the NameNode or did you tail multiple logs? It seems odd that your namenode would be trying to make an IPC client to itself (port 8020). Just the name node. After you see these messages, does your namenode shut down? It kind of just hangs really (Already tried 9 times, Already tried 10 times etc) Does jps show it running? Is the web interface available at port 50070? No, it never is available. I'll keep hunting for some config error. Cheers Tim -Todd On Wed, Oct 14, 2009 at 2:33 AM, tim robertson timrobertson...@gmail.com wrote: Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the right direction for diagnosing where I have gone wrong? Thanks Tim / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cluma.gbif.org/192.38.28.77 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 810220; compiled by 'oom' on Tue Sep 1 20:55:56 UTC 2009 / 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 2009-10-14 11:09:53,013 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: cluma.gbif.clu/192.168.76.254:8020 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-10-14 11:09:53,019 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,056 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-10-14 11:09:53,057 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-10-14 11:09:53,064 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-10-14 11:09:53,065 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-10-14 11:09:53,090 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 loaded in 0 seconds. 2009-10-14 11:09:53,094 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds. 2009-10-14 11:09:53,098 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 saved in 0 seconds. 2009-10-14 11:09:53,113 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 72 msecs 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of under-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of over-replicated blocks = 0 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 0 secs. 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to
normal hadoop errors?
I got the following error while running the example sort program (hadoop 0.20) on a brand new Hadoop cluster (using the Cloudera distro). The job seems to have recovered. However I'm wondering if this is normal or should I be checking for something. attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.10.10.52:50010 attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient: Abandoning block blk_-7778196938228518311_13172 attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient: Waiting to find target node: 10.10.10.56:50010 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:01 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.10.10.55:50010 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:01 INFO hdfs.DFSClient: Abandoning block blk_-7309503129247220072_13172 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:09 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.10.10.52:50010 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:09 INFO hdfs.DFSClient: Abandoning block blk_3948102363851753370_13172 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:15 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.10.10.55:50010 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:15 INFO hdfs.DFSClient: Abandoning block blk_-9105283762069697302_13172 attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2813) attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient: Error Recovery for block blk_-9105283762069697302_13172 bad datanode[1] nodes == null attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient: Could not get block locations. Source file /user/hadoop/sort_out/_temporary/_attempt_200910051513_0009_r_05_0/part-5 - Aborting... attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN mapred.TaskTracker: Error running child attempt_200910051513_0009_r_05_0: java.io.IOException: Bad connect ack with firstBadLink 10.10.10.55:50010 attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) attempt_200910051513_0009_r_05_0: at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 INFO mapred.TaskRunner: Runnning cleanup for the task 09/10/15 09:54:37 INFO mapred.JobClient: map 100% reduce 82% 09/10/15 09:54:41 INFO mapred.JobClient: map 100% reduce 83% 09/10/15 09:54:43 INFO mapred.JobClient: Task Id : attempt_200910051513_0009_r_17_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.10.10.56:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN conf.Configuration: /h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a attempt to override final parameter: hadoop.tmp.dir; Ignoring. attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN conf.Configuration: /h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a attempt to override final parameter: hadoop.rpc.socket.factory.class.default; Ignoring. attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN conf.Configuration: /h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a attempt to override final parameter: tasktracker.http.threads; Ignoring. attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN conf.Configuration:
Re: Optimization of cpu and i/o usage / other bottlenecks?
Thanks for your help Jason. I actually did reduce the heap size to 400M and it sped things up a few percent. From my experience with jvm's, if you can handle lower amounts of heap your app will run faster because GC is more efficient for smaller garbage collections (which is also why using incremental garbage collection is often desired in web apps). But, the contrast to that is if you give it too little heap it can choke your app and run the garbage collection too often, sort of like it is drowning and gasping for breath. Jason Venner wrote: The value really varies by job and by cluster, the larger the split, the more chance there is that a small number of splits will take much longer to complete than the rest resulting in a long job tail where very little of your cluster is utilized while they complete. The flip side is with very small task the overhead, startup time and co-ordination latency (which has been improved) can cause in efficient utilization of your cluster resource. If you really want to drive up your CPU utilization, reduce your per task memory size to the bare minimum, and your JVM's will consume massive amounts of CPU doing garbage collection :) It happend at a place I worked where ~60% of the job cpu was garbage collection. On Wed, Oct 14, 2009 at 11:26 AM, Chris Seline ch...@searchles.com wrote: That definitely helps a lot! I saw a few people talking about it on the webs, and they say to set the value to Long.MAX_VALUE, but that is not what I have found to be best. I see about 25% improvement at 300MB (3), CPU utilization is up to about 50-70%+, but I am still fine tuning. thanks! Chris Jason Venner wrote: I remember having a problem like this at one point, it was related to the mean run time of my tasks, and the rate that the jobtracker could start new tasks. By increasing the split size until the mean run time of my tasks was in the minutes, I was able to drive up the utilization. On Wed, Oct 14, 2009 at 7:31 AM, Chris Seline ch...@searchles.com wrote: No, there doesn't seem to be all that much network traffic. Most of the time traffic (measured with nethogs) is about 15-30K/s on the master and slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe 5-10 seconds on a query that takes 10 minutes, but that is still less than what I see in scp transfers on EC2, which is typically about 30 MB/s. thanks Chris Jason Venner wrote: are your network interface or the namenode/jobtracker/datanodes saturated On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com wrote: I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance available with 20 compute units and 4x 400gb disks. I wrote some scripts to test many (100's) of configurations running a particular Hive query to try to make it as fast as possible, but no matter what I don't seem to be able to get above roughly 45% cpu utilization on the slaves, and not more than about 1.5% wait state. I have also measured network traffic and there don't seem to be bottlenecks there at all. Here are some typical CPU utilization lines from top on a slave when running a query: Cpu(s): 33.9%us, 7.4%sy, 0.0%ni, 56.8%id, 0.6%wa, 0.0%hi, 0.5%si, 0.7%st Cpu(s): 33.6%us, 5.9%sy, 0.0%ni, 58.7%id, 0.9%wa, 0.0%hi, 0.4%si, 0.5%st Cpu(s): 33.9%us, 7.2%sy, 0.0%ni, 56.8%id, 0.5%wa, 0.0%hi, 0.6%si, 1.0%st Cpu(s): 38.6%us, 8.7%sy, 0.0%ni, 50.8%id, 0.5%wa, 0.0%hi, 0.7%si, 0.7%st Cpu(s): 36.8%us, 7.4%sy, 0.0%ni, 53.6%id, 0.4%wa, 0.0%hi, 0.5%si, 1.3%st It seems like if tuned properly, I should be able to max out my cpu (or my disk) and get roughly twice the performance I am seeing now. None of the parameters I am tuning seem to be able to achieve this. Adjusting mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the io.file.buffer.size to 4096 does better than the default, but the rest of the values I am testing seem to have little positive effect. These are the parameters I am testing, and the values tried: io.sort.factor=2,3,4,5,10,15,20,25,30,50,100 mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 io.bytes.per.checksum=256,512,1024,2048,4192 mapred.output.compress=true,false hive.exec.compress.intermediate=true,false hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200 mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200 mapred.merge.recordsBeforeProgress=5000,1,2,3 mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99
Re: Hardware performance from HADOOP cluster
Hi Usmam, So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on high memory jobs so picked 4 only) [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives] Using your template for stats, I get the following with no tuning: GENERATE RANDOM DATA Wrote out 90GB of random binary data: Map output records=9198009 The job took 350 seconds. (approximately: 6 minutes) SORT RANDOM GENERATED DATA Map output records= 9197821 Reduce input records=9197821 The job took 2176 seconds. (approximately: 36mins). So pretty similar to your initial benchmark. I will tune it a bit tomorrow and rerun. If you spent time tuning your cluster and it was successful, please can you share your config? Cheers, Tim On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote: Hi Todd, Some changes have been applied to the cluster based on the documentation (URL) you noted below, like file descriptor settings and io.file.buffer.size. I will check out the other settings which I haven't applied yet. My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml on all nodes in the cluster. _*hadoop-site.xml *_mapred.tasktracker.task.maximum = 2 mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.reduce.tasks.maximum = 8 _* hadoop-default.xml *_mapred.map.tasks = 2 mapred.reduce.tasks = 1 Thanks, Usman This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and reduce slots did you configure for the daemons? If you have Ganglia installed you can usually get a good idea of whether you're using your resources well by looking at the graphs while running a job like this sort. -Todd On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote: Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GENERATED DATA Map output records=4088301 Reduce input records=4088301 The job took 2136 seconds. (approximately: 35 minutes). VALIDATION OF SORTED DATA The job took 183 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully. It would be interesting to see what performance numbers others with a similar setup have obtained. Thanks, Usman I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote: Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sort and posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote: Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. Thanks, Usman
Re: Hardware performance from HADOOP cluster
Hi Tim, Thanks much for sharing the info. I will most certainly share my configuration settings after applying some tuning at my end. Will let you know the results on this email list. Thanks, Usman Hi Usmam, So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on high memory jobs so picked 4 only) [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives] Using your template for stats, I get the following with no tuning: GENERATE RANDOM DATA Wrote out 90GB of random binary data: Map output records=9198009 The job took 350 seconds. (approximately: 6 minutes) SORT RANDOM GENERATED DATA Map output records= 9197821 Reduce input records=9197821 The job took 2176 seconds. (approximately: 36mins). So pretty similar to your initial benchmark. I will tune it a bit tomorrow and rerun. If you spent time tuning your cluster and it was successful, please can you share your config? Cheers, Tim On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote: Hi Todd, Some changes have been applied to the cluster based on the documentation (URL) you noted below, like file descriptor settings and io.file.buffer.size. I will check out the other settings which I haven't applied yet. My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml on all nodes in the cluster. _*hadoop-site.xml *_mapred.tasktracker.task.maximum = 2 mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.reduce.tasks.maximum = 8 _* hadoop-default.xml *_mapred.map.tasks = 2 mapred.reduce.tasks = 1 Thanks, Usman This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and reduce slots did you configure for the daemons? If you have Ganglia installed you can usually get a good idea of whether you're using your resources well by looking at the graphs while running a job like this sort. -Todd On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote: Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GENERATED DATA Map output records=4088301 Reduce input records=4088301 The job took 2136 seconds. (approximately: 35 minutes). VALIDATION OF SORTED DATA The job took 183 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully. It would be interesting to see what performance numbers others with a similar setup have obtained. Thanks, Usman I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote: Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sort and posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote: Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. Thanks, Usman
Hardware Setup
So my company is looking at only using dell or hp for our hadoop cluster and a sun thumper to backup the data. The prices are ok, after a 40% discount, but realistically I am paying twice as much as if I went to silicon mechanics, and with a much slower machine. It seems as though the big expense are the disks. Even with a 40% discount 550$ per 1tb disk seems crazy expensive. Also, they are pushing me to build a smaller cluster (6 nodes) and I am pushing back for nodes half the size but having twice as many. So how much of a performance difference can I expect btwn 12 nodes with 1 xeon 5 series running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1 tb disks. Both setups will also have 2 very small sata drives in raid 1 for the OS. I will be doing some stuff with hadoop and a lot of stuff with HBase. What are the considerations with HDFS performance with a low number of nodes,etc.
Re: Hardware performance from HADOOP cluster
Hi Tim, I assume those are single proc machines? I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we have dual quad Nehalems (2.26Ghz). On Thu, Oct 15, 2009 at 11:34 AM, tim robertson timrobertson...@gmail.comwrote: Hi Usmam, So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on high memory jobs so picked 4 only) [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives] Using your template for stats, I get the following with no tuning: GENERATE RANDOM DATA Wrote out 90GB of random binary data: Map output records=9198009 The job took 350 seconds. (approximately: 6 minutes) SORT RANDOM GENERATED DATA Map output records= 9197821 Reduce input records=9197821 The job took 2176 seconds. (approximately: 36mins). So pretty similar to your initial benchmark. I will tune it a bit tomorrow and rerun. If you spent time tuning your cluster and it was successful, please can you share your config? Cheers, Tim On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote: Hi Todd, Some changes have been applied to the cluster based on the documentation (URL) you noted below, like file descriptor settings and io.file.buffer.size. I will check out the other settings which I haven't applied yet. My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml on all nodes in the cluster. _*hadoop-site.xml *_mapred.tasktracker.task.maximum = 2 mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.reduce.tasks.maximum = 8 _* hadoop-default.xml *_mapred.map.tasks = 2 mapred.reduce.tasks = 1 Thanks, Usman This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and reduce slots did you configure for the daemons? If you have Ganglia installed you can usually get a good idea of whether you're using your resources well by looking at the graphs while running a job like this sort. -Todd On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote: Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GENERATED DATA Map output records=4088301 Reduce input records=4088301 The job took 2136 seconds. (approximately: 35 minutes). VALIDATION OF SORTED DATA The job took 183 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully. It would be interesting to see what performance numbers others with a similar setup have obtained. Thanks, Usman I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote: Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sortand posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote: Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. Thanks, Usman
job during datanode decommisioning?
Hi, Is it safe to run the jobs during the datanode decommissioning? Around 10% of the total datanodes are being decommissioned and i want to run a job mean while. Wanted to confirm whether it is safe to do this. Thanks, Murali Krishna
Re: Hardware Setup
Hey Alex, In order to lower cost, you'll probably want to order the worker nodes without hard drives then buy them separately. HDFS provides a software-level RAID, so most of the reasonings behind buying hard drives from Dell/HP are irrelevant - you are just paying an extra $400 per hard drive. I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve knows an HP model that has something similar. However, BE VERY CAREFUL when you do this. From experience, a certain large manufacturer (I don't know about Dell/HP) will refuse to ship (or sell separately) hard drive trays if you order their machine without hard drives. When this happened to us, we were not able to return the machines because they were custom orders. Eventually, we had to get someone to go to the machine shop and build 72 hard drive trays for us. Worst. Experience. Ever. So, ALWAYS ASK and make sure that you can buy empty hard drive trays for that specific model (or at least that it ships with them). Brian On Oct 15, 2009, at 10:48 AM, Alex Newman wrote: So my company is looking at only using dell or hp for our hadoop cluster and a sun thumper to backup the data. The prices are ok, after a 40% discount, but realistically I am paying twice as much as if I went to silicon mechanics, and with a much slower machine. It seems as though the big expense are the disks. Even with a 40% discount 550$ per 1tb disk seems crazy expensive. Also, they are pushing me to build a smaller cluster (6 nodes) and I am pushing back for nodes half the size but having twice as many. So how much of a performance difference can I expect btwn 12 nodes with 1 xeon 5 series running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1 tb disks. Both setups will also have 2 very small sata drives in raid 1 for the OS. I will be doing some stuff with hadoop and a lot of stuff with HBase. What are the considerations with HDFS performance with a low number of nodes,etc. smime.p7s Description: S/MIME cryptographic signature
Re: Hardware Setup
After the discount, an equivalently configured Dell comes about 10-20% over the Silicon Mechanics price. It's close enough that unless you're spending 100k it won't make that much of a difference. Talk to a rep, call them out on the ridiculous drive pricing, buy at the end of their fiscal quarter. Strip down the machines (no RAID cards, no DVD/CD drive, non-redundant power supply, etc.) to get the price lower. No need for dedicated SATA drives with RAID for your OS. Most of that is accessed during boot time so it won't contend that much with HDFS. We just got a bunch of Dell R410s with 24GB ram, 2x2.26Ghz procs and 4x1TB drives. I would go for beefier nodes with less quantity. Of course, some of this depends on the volume of data and type of processing that you do. If you're running HBase, you would benefit from lots of RAM. You also have to remember that dual socket configs are more power efficient so you can fit more in a single rack. Cheers, - P On Thu, Oct 15, 2009 at 11:48 AM, Alex Newman posi...@gmail.com wrote: So my company is looking at only using dell or hp for our hadoop cluster and a sun thumper to backup the data. The prices are ok, after a 40% discount, but realistically I am paying twice as much as if I went to silicon mechanics, and with a much slower machine. It seems as though the big expense are the disks. Even with a 40% discount 550$ per 1tb disk seems crazy expensive. Also, they are pushing me to build a smaller cluster (6 nodes) and I am pushing back for nodes half the size but having twice as many. So how much of a performance difference can I expect btwn 12 nodes with 1 xeon 5 series running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1 tb disks. Both setups will also have 2 very small sata drives in raid 1 for the OS. I will be doing some stuff with hadoop and a lot of stuff with HBase. What are the considerations with HDFS performance with a low number of nodes,etc.
Re: Hardware Setup
Brian Bockelman wrote: Hey Alex, In order to lower cost, you'll probably want to order the worker nodes without hard drives then buy them separately. HDFS provides a software-level RAID, so most of the reasonings behind buying hard drives from Dell/HP are irrelevant - you are just paying an extra $400 per hard drive. I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve knows an HP model that has something similar. I will start the official disclaimer I make no recommendations about hardware here, so as not to get into trouble. Talk to you reseller or account team. You can get servers with lots of drives in them DL180 and SL170z are acronyms that spring to mind. Big issues to consider * server:CPU ratio * power budget * rack weight * do you ever plan to stick in more CPUs? Some systems take this, others don't. * Intel vs AMD * How much ECC RAM can you afford. And yes, it must be ECC. server disks are higher RPM and specced for more hours than consumer disks, I don't know what that means in terms of lifespan, but the RPM translates into bandwidth off the disk. However, BE VERY CAREFUL when you do this. From experience, a certain large manufacturer (I don't know about Dell/HP) will refuse to ship (or sell separately) hard drive trays if you order their machine without hard drives. When this happened to us, we were not able to return the machines because they were custom orders. Eventually, we had to get someone to go to the machine shop and build 72 hard drive trays for us. that is what physics PhD students are for, at least they didn't get a lifetimes radiation dose for this job Worst. Experience. Ever. So, ALWAYS ASK and make sure that you can buy empty hard drive trays for that specific model (or at least that it ships with them). Brian On Oct 15, 2009, at 10:48 AM, Alex Newman wrote: So my company is looking at only using dell or hp for our hadoop cluster and a sun thumper to backup the data. The prices are ok, after a 40% discount, but realistically I am paying twice as much as if I went to silicon mechanics, and with a much slower machine. It seems as though the big expense are the disks. Even with a 40% discount 550$ per 1tb disk seems crazy expensive. Also, they are pushing me to build a smaller cluster (6 nodes) and I am pushing back for nodes half the size but having twice as many. So how much of a performance difference can I expect btwn 12 nodes with 1 xeon 5 series running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1 tb disks. Both setups will also have 2 very small sata drives in raid 1 for the OS. I will be doing some stuff with hadoop and a lot of stuff with HBase. What are the considerations with HDFS performance with a low number of nodes,etc. It's an interesting Q as to what is better, fewer nodes with more storage/CPU or more, smaller nodes. Bigger servers * more chance of running code near the data * less data moved over the LAN at shuffle time * RAM consumption can be more agile across tasks. * increased chance of disk failure on a node; hadoop handles that very badly right now (pre 0.20 -datanode goes offline) Smaller servers * easier to place data redundantly across machines * less RAM taken up by other people's jobs * more nodes stay up when a disk fails (less important on 0.20 onwards) * when a node goes down, less data to re-replicate across the other machines 1. I would like to hear other people's opinions, 2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks from your real data runs. Try that, collect some data, then go to your suppliers. -Steve COI disclaimer signature: --- Hewlett-Packard Limited Registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 690597 England
Re: Hardware Setup
On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote: No need for dedicated SATA drives with RAID for your OS. Most of that is accessed during boot time so it won't contend that much with HDFS. You may want to RAID your OS. If you lose a datanode with a large volume of data say (8 TB) Hadoop will begin the process of re-replicating that data elsewhere, that can use cluster resources. You MIGHT want to avoid that, or maybe you do not care. Having 2 disks for the OS is a waist of bays, so we got clever. Take a system with 8 drives @ 1TB. Slice off ~30 GB from two of the disks and use Linux software RAID-1 MIRROR for the OS+ swap. Now you don't need to separate disks for the OS and you don't run the risk of losing that one disk that takes down the entire DataNode. Forgot to mention it, but that is exactly what we do. We considered net-booting as an option, but we were time-constrained and so didn't look that deeply into it. I'd be interested in hearing others that have used networked boot...
RE: Can't I use hive? What does this exception mean?
Clark, I think the error shows that the hive failed when trying to run some DDL to the meta_store database. How do you set up your meta data database of hive? If it is derby, make sure you have only one connection in one directory. (Sometimes the error may be caused when you have another connection under the same directory and forget to close it) If it is mysql, make sure you have downloaded the JDBC driver, which is not included in the hive package if you downloaded it form apache. (It is included in cloudera version) Let me know if you have any further questions. Tan Li (from shopping.com) -Original Message- From: Clark Yang (杨卓荦) [mailto:clarkyzl-h...@yahoo.com.cn] Sent: Thursday, October 15, 2009 4:24 AM To: common-user@hadoop.apache.org Subject: Can't I use hive? What does this exception mean? I have already installed hadoop correctly. but what does that mean? ~/hive/build/dist/bin$ ./hive Hive history file=/tmp/yangzhuoluo/hive_job_log_yangzhuoluo_200910151919_995767046.txt hive show tables; FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Can anybody tell me what does the exception mean? Thank you all very much. ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: job during datanode decommisioning?
If you have replication set up, it shouldn't be a problem... On 10/15/09, Murali Krishna. P muralikpb...@yahoo.com wrote: Hi, Is it safe to run the jobs during the datanode decommissioning? Around 10% of the total datanodes are being decommissioned and i want to run a job mean while. Wanted to confirm whether it is safe to do this. Thanks, Murali Krishna -- Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: Hardware Setup
On Thu, Oct 15, 2009 at 12:51 PM, Patrick Angeles patrickange...@gmail.com wrote: On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote: No need for dedicated SATA drives with RAID for your OS. Most of that is accessed during boot time so it won't contend that much with HDFS. You may want to RAID your OS. If you lose a datanode with a large volume of data say (8 TB) Hadoop will begin the process of re-replicating that data elsewhere, that can use cluster resources. You MIGHT want to avoid that, or maybe you do not care. Having 2 disks for the OS is a waist of bays, so we got clever. Take a system with 8 drives @ 1TB. Slice off ~30 GB from two of the disks and use Linux software RAID-1 MIRROR for the OS+ swap. Now you don't need to separate disks for the OS and you don't run the risk of losing that one disk that takes down the entire DataNode. Forgot to mention it, but that is exactly what we do. We considered net-booting as an option, but we were time-constrained and so didn't look that deeply into it. I'd be interested in hearing others that have used networked boot... Patrick, I saw someone troubleshooting netboot recently on list. There was some minor hangup, I they got it working http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200909.mbox/%3c4ac22618.5080...@sci.utah.edu%3e Ed
Re: NullPointer on starting NameNode
Thanks Todd - it's all working perfectly now. By the way, where is the Cloudera repository? On Wed, 14 Oct 2009 10:02:22 -0700, Todd Lipcon t...@cloudera.com wrote: Hi Bryn, Just to let you know, we've queued the patch Hairong mentioned for the next update to our distribution, due out around the end of this month. Thanks! -Todd On Wed, Oct 14, 2009 at 9:15 AM, Bryn Divey b...@bengueladev.com wrote: Hi all, I'm getting the following on initializing my NameNode. The actual line throwing the exception is if (atime != -1) { - long inodeTime = inode.getAccessTime(); Have I corrupted the fsimage or something? This is on the Cloudera packaging of Hadoop 0.20.1+133. Regards, Bryn 09/10/14 18:12:02 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 09/10/14 18:12:02 INFO namenode.NameNode: Namenode up at: 10.23.4.172/10.23.4.172:8020 09/10/14 18:12:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 09/10/14 18:12:02 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 09/10/14 18:12:03 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 09/10/14 18:12:03 INFO namenode.FSNamesystem: supergroup=supergroup 09/10/14 18:12:03 INFO namenode.FSNamesystem: isPermissionEnabled=false 09/10/14 18:12:03 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 09/10/14 18:12:03 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean 09/10/14 18:12:03 INFO common.Storage: Number of files = 80 09/10/14 18:12:03 INFO common.Storage: Number of files under construction = 0 09/10/14 18:12:03 INFO common.Storage: Image file of size 19567 loaded in 0 seconds. 09/10/14 18:12:03 ERROR namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:206) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:968) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:977)
Re: Hardware Setup
On 10/15/09 9:42 AM, Steve Loughran ste...@apache.org wrote: It's an interesting Q as to what is better, fewer nodes with more storage/CPU or more, smaller nodes. Bigger servers * more chance of running code near the data * less data moved over the LAN at shuffle time * RAM consumption can be more agile across tasks. * increased chance of disk failure on a node; hadoop handles that very badly right now (pre 0.20 -datanode goes offline) Smaller servers * easier to place data redundantly across machines * less RAM taken up by other people's jobs * more nodes stay up when a disk fails (less important on 0.20 onwards) * when a node goes down, less data to re-replicate across the other machines 1. I would like to hear other people's opinions, - Don't forget the about the more obvious things: if you go with more disks per server, that also means likely means less controllers doing IO. - Keep in mind that fewer CPUs/less RAM=less task slots available. While your workflow may not be CPU-bound in the traditional sense, if you are spawning 5000 maps, you're going to need quite a few slots to get your work done in a reasonable time. - To counter that, it seems we can run more tasks-per-node in LI's 2U config than Y!'s 1U config. But this might be an apples/oranges comparison (LI uses Solaris+ZFS, Y! uses Linux+ext3). 2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks from your real data runs. Try that, collect some data, then go to your suppliers. +1
Re: NullPointer on starting NameNode
On Thu, Oct 15, 2009 at 10:14 AM, b...@bengueladev.com wrote: Thanks Todd - it's all working perfectly now. By the way, where is the Cloudera repository? http://archive.cloudera.com/ If you have any questions that are Cloudera-specific (around packaging, etc), please use our GetSatisfaction page: http://getsatisfaction.com/cloudera/products/cloudera_cloudera_s_distribution_for_hadoop (we don't want to confuse people on this list if anything is specific to our distro) -Todd On Wed, 14 Oct 2009 10:02:22 -0700, Todd Lipcon t...@cloudera.com wrote: Hi Bryn, Just to let you know, we've queued the patch Hairong mentioned for the next update to our distribution, due out around the end of this month. Thanks! -Todd On Wed, Oct 14, 2009 at 9:15 AM, Bryn Divey b...@bengueladev.com wrote: Hi all, I'm getting the following on initializing my NameNode. The actual line throwing the exception is if (atime != -1) { - long inodeTime = inode.getAccessTime(); Have I corrupted the fsimage or something? This is on the Cloudera packaging of Hadoop 0.20.1+133. Regards, Bryn 09/10/14 18:12:02 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=8020 09/10/14 18:12:02 INFO namenode.NameNode: Namenode up at: 10.23.4.172/10.23.4.172:8020 09/10/14 18:12:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 09/10/14 18:12:02 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 09/10/14 18:12:03 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 09/10/14 18:12:03 INFO namenode.FSNamesystem: supergroup=supergroup 09/10/14 18:12:03 INFO namenode.FSNamesystem: isPermissionEnabled=false 09/10/14 18:12:03 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 09/10/14 18:12:03 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean 09/10/14 18:12:03 INFO common.Storage: Number of files = 80 09/10/14 18:12:03 INFO common.Storage: Number of files under construction = 0 09/10/14 18:12:03 INFO common.Storage: Image file of size 19567 loaded in 0 seconds. 09/10/14 18:12:03 ERROR namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:206) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:968) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:977)
How can I deploy 100 blocks onto 10 datanodes with each node have 10 blocks?
Hi everyone. I am working on a project with hadoop and now I come across some problem. How can I deploy 100 files, with each file have one block by setting the blocksize and controling the file size, on to 10 datanode, and make sure each datanode has 10 blocks. I know the file system can deploy the blocks automaticly, but I want to make sure for the assigns files, the files will be deployed well-proportioned. How can I make it by the hadoop tool or api? Huang Qian(黄骞) Institute of Remote Sensing and GIS,Peking University Phone: (86-10) 5276-3109 Mobile: (86) 1590-126-8883 Address:Rm.554,Building 1,ChangChunXinYuan,Peking Univ.,Beijing(100871),CHINA
Re: Hardware performance from HADOOP cluster
Yeah they are single proc machines and other than setting to 4 map/reduces, completely 0.20.1 vanilla installation. I will tune it up in the morning based on what I can find on the web (e.g. cloudera guidelines) and post the results. I am going to be running HBase on top of this, but want to make sure the HDFS/MR is running sound before continuing. Seems there are a few people at the moment setting up clusters - might it be worth adding our config and results to http://wiki.apache.org/hadoop/HardwareBenchmarks ? For people like me (first cluster set up from scratch - previously used the EC2 scripts) it is nice to sanity check things look about right. The mailing lists suggest there are a few small clusters of medium spec machines springing up. Cheers, Tim On Thu, Oct 15, 2009 at 5:52 PM, Patrick Angeles patrickange...@gmail.com wrote: Hi Tim, I assume those are single proc machines? I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we have dual quad Nehalems (2.26Ghz). On Thu, Oct 15, 2009 at 11:34 AM, tim robertson timrobertson...@gmail.comwrote: Hi Usmam, So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on high memory jobs so picked 4 only) [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives] Using your template for stats, I get the following with no tuning: GENERATE RANDOM DATA Wrote out 90GB of random binary data: Map output records=9198009 The job took 350 seconds. (approximately: 6 minutes) SORT RANDOM GENERATED DATA Map output records= 9197821 Reduce input records=9197821 The job took 2176 seconds. (approximately: 36mins). So pretty similar to your initial benchmark. I will tune it a bit tomorrow and rerun. If you spent time tuning your cluster and it was successful, please can you share your config? Cheers, Tim On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote: Hi Todd, Some changes have been applied to the cluster based on the documentation (URL) you noted below, like file descriptor settings and io.file.buffer.size. I will check out the other settings which I haven't applied yet. My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml on all nodes in the cluster. _*hadoop-site.xml *_mapred.tasktracker.task.maximum = 2 mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.reduce.tasks.maximum = 8 _* hadoop-default.xml *_mapred.map.tasks = 2 mapred.reduce.tasks = 1 Thanks, Usman This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and reduce slots did you configure for the daemons? If you have Ganglia installed you can usually get a good idea of whether you're using your resources well by looking at the graphs while running a job like this sort. -Todd On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote: Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GENERATED DATA Map output records=4088301 Reduce input records=4088301 The job took 2136 seconds. (approximately: 35 minutes). VALIDATION OF SORTED DATA The job took 183 seconds. SUCCESS! Validated the MapReduce framework's 'sort' successfully. It would be interesting to see what performance numbers others with a similar setup have obtained. Thanks, Usman I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote: Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sortand posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote: Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack. Thanks, Usman
Re: Hardware Setup
Hi Alex, I am also doing a little on HBase - I think I have heard that a few higher memory machines with more spindles and cores per machine beat more smaller machines with similar total capacity (I *guess* this is due to memory buffers and data locality). Please don't take my word for it, but I recommend posting the same to the HBase list - http://hadoop.apache.org/hbase/mailing_lists.html#Users. Cheers, Tim On Thu, Oct 15, 2009 at 7:54 PM, Allen Wittenauer awittena...@linkedin.com wrote: On 10/15/09 9:42 AM, Steve Loughran ste...@apache.org wrote: It's an interesting Q as to what is better, fewer nodes with more storage/CPU or more, smaller nodes. Bigger servers * more chance of running code near the data * less data moved over the LAN at shuffle time * RAM consumption can be more agile across tasks. * increased chance of disk failure on a node; hadoop handles that very badly right now (pre 0.20 -datanode goes offline) Smaller servers * easier to place data redundantly across machines * less RAM taken up by other people's jobs * more nodes stay up when a disk fails (less important on 0.20 onwards) * when a node goes down, less data to re-replicate across the other machines 1. I would like to hear other people's opinions, - Don't forget the about the more obvious things: if you go with more disks per server, that also means likely means less controllers doing IO. - Keep in mind that fewer CPUs/less RAM=less task slots available. While your workflow may not be CPU-bound in the traditional sense, if you are spawning 5000 maps, you're going to need quite a few slots to get your work done in a reasonable time. - To counter that, it seems we can run more tasks-per-node in LI's 2U config than Y!'s 1U config. But this might be an apples/oranges comparison (LI uses Solaris+ZFS, Y! uses Linux+ext3). 2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks from your real data runs. Try that, collect some data, then go to your suppliers. +1
Hadoop Developer Needed
Overview The SQL Data Migration Specialist plays a crucial role in converting new Client's data onto Brilig's service platforms. We are looking for a talented and energetic full-time freelance programmer to work both remotely and onsite at our midtown Manhattan location. The Specialist will work with our clients' technical teams to determine the optimal formats and requirements to create files for subsequent import into a Brilig remote database during the implementation process. This process involves extracting, scrubbing, combining, transforming, validating and importing large data tables into final data sets suitable for loading into Brilig's defined databases. The Specialist will be responsible for creating/editing the database structure and writing of all import scripts and programs. The ability to work on multiple projects simultaneously while meeting tight deadlines is critical. This project will last for 3 months but may be extended or may eventually lead to a full time position in our fun and exciting startup. Must be able to travel to client meetings and work independently. Responsibilities: - Subject Matter Expert on software tools used in the entire data migration process from extraction to validation and load - Design, develop and execute quality data movement processes that are consistent, repeatable and scalable - Streamline testing, audit and validation processes through data scrubbing routines and presentation of audit reports prior to load - Roll out newly developed processes via documentation and training - Maintain and manage a template library of executed solutions to leverage against future opportunities - Identify, clarify, and resolve issues and risks, escalating them as needed - Build and nourish strong business relationships with external clients Please include: - Salary Requirements - Availability Experience - At least 3-5 years experience in the development of java applications - Use of XML and other protocols for data exchange between systems - SQL database design and implementation - Experience with Eclipse, Maven, and SVN a plus - Experience with Htable and Hadoop a big plus - Excellent communication skills with both technical and non-technical colleagues - Upper management and client facing skills - Interest in keeping up with technology advances PLEASE NOTE: US citizens and Green Card Holders and those authorized to work in the US only. We are unable to sponsor or transfer H-1B candidates. Contact: Alex Levin, COO Brilig ale...@brilig.com -- View this message in context: http://www.nabble.com/Hadoop-Developer-Needed-tp25914537p25914537.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
newbie help, type mismatch in mapper
Hi, I am new to Hadoop so this might be an easy question for someone to help me with. I continually am getting this exception (my code follows below) java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask $NewOutputCollector.write(MapTask.java:504) at org .apache .hadoop .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Running 0.20.1 I have a text file with lines of data separated by carriage returns. This is properly stored in a directory within HDFS. I only have a Mapping task for processing this file, after the mapping is done it should go straight to output, No reduce or combiner functions. I am just trying to test to see if this will run. The mapper just takes the data sent to it and adds it back to the collector as 2 text values. My Mapper: public class MyTestMapper extends MapperLongWritable,Text,Text,Text { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String key = key.toString(); String line = value.toString(); String id = extractId(line); String reformattedLine = reformatLine(line); context.write(new Text(id), new Text(reformattedLine)); } } My job submission code: - Job job = new Job(conf); job.setJarByClass(MyTestMapper.class); job.setMapperClass(MyTestMapper.class); FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt)); FileOutputFormat.setOutputPath(job, new Path(/myDir/output/ results-+System.currentTimeMillis()+.txt)); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.submit();
type mismatch, newbie help
Hi, I am new to Hadoop so this might be an easy question for someone to help me with. I continually am getting this exception (my code follows below) java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask $NewOutputCollector.write(MapTask.java:504) at org .apache .hadoop .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Running 0.20.1 I have a text file with lines of data separated by carriage returns. This is properly stored in a directory within HDFS. I only have a Mapping task for processing this file, after the mapping is done it should go straight to output, No reduce or combiner functions. I am just trying to test to see if this will run. The mapper just takes the data sent to it and adds it back to the collector as 2 text values. My Mapper: public class MyTestMapper extends MapperLongWritable,Text,Text,Text { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String key = key.toString(); String line = value.toString(); String id = extractId(line); String reformattedLine = reformatLine(line); context.write(new Text(id), new Text(reformattedLine)); } } My job submission code: - Job job = new Job(conf); job.setJarByClass(MyTestMapper.class); job.setMapperClass(MyTestMapper.class); FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt)); FileOutputFormat.setOutputPath(job, new Path(/myDir/output/ results-+System.currentTimeMillis()+.txt)); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.submit();
Re: Hadoop Developer Needed
Hi Alex, I'm a senior Java/J2EE developer/architect. I am American, but live abroad. I've been using Hadoop for the past year, but have many years experience in the field of large-scale data analysis, warehousing, etc. Let me know what you think. Cheers, Lajos alevin wrote: Overview The SQL Data Migration Specialist plays a crucial role in converting new Client's data onto Brilig's service platforms. We are looking for a talented and energetic full-time freelance programmer to work both remotely and onsite at our midtown Manhattan location. The Specialist will work with our clients' technical teams to determine the optimal formats and requirements to create files for subsequent import into a Brilig remote database during the implementation process. This process involves extracting, scrubbing, combining, transforming, validating and importing large data tables into final data sets suitable for loading into Brilig's defined databases. The Specialist will be responsible for creating/editing the database structure and writing of all import scripts and programs. The ability to work on multiple projects simultaneously while meeting tight deadlines is critical. This project will last for 3 months but may be extended or may eventually lead to a full time position in our fun and exciting startup. Must be able to travel to client meetings and work independently. Responsibilities: - Subject Matter Expert on software tools used in the entire data migration process from extraction to validation and load - Design, develop and execute quality data movement processes that are consistent, repeatable and scalable - Streamline testing, audit and validation processes through data scrubbing routines and presentation of audit reports prior to load - Roll out newly developed processes via documentation and training - Maintain and manage a template library of executed solutions to leverage against future opportunities - Identify, clarify, and resolve issues and risks, escalating them as needed - Build and nourish strong business relationships with external clients Please include: - Salary Requirements - Availability Experience - At least 3-5 years experience in the development of java applications - Use of XML and other protocols for data exchange between systems - SQL database design and implementation - Experience with Eclipse, Maven, and SVN a plus - Experience with Htable and Hadoop a big plus - Excellent communication skills with both technical and non-technical colleagues - Upper management and client facing skills - Interest in keeping up with technology advances PLEASE NOTE: US citizens and Green Card Holders and those authorized to work in the US only. We are unable to sponsor or transfer H-1B candidates. Contact: Alex Levin, COO Brilig ale...@brilig.com -- *** The 'Last-Chance' Architect www.galatea.com (US) +1 303 731 3116 (UK) +44 20 8144 4367 ***
Re: Optimization of cpu and i/o usage / other bottlenecks?
What Hadoop version? On a clusster this size there are two things to check right away: 1. In the Hadoop UI, during the job, are the reduce and map slots close to being filled up most of the time, or are tasks completing faster than the scheduler can keep up so that there are often many empty slots? For 0.19.x and 0.20.x on a small cluster like this, use the Fair Scheduler and make sure the configuration parameter that allows it to schedule more than one task per heartbeat is on (at least one map and one reduce per, which is supported in 0.19.x). This alone will cut times down if the number of map and reduce tasks is at least 2x the number of nodes. 2. Your CPU, Disk and Network aren't saturated -- take a look at the logs of the reduce tasks and look for long delays in the shuffle. Utilization is throttled by a bug in the reducer shuffle phase, not fixed until 0.21. Simply put, a single reduce task won't fetch more than one map output from another node every 2 seconds (though it can fetch from multiple nodes at once). Fix this by commenting out one line in 0.18.x, 0.19.x or 0.20.x -- see my comment here: http://issues.apache.org/jira/browse/MAPREDUCE-318 From June 10 2009. I saw shuffle times on small clusters with large map task count per node ratio decrease by a factor of 30 from that one line fix. It was the only way to get the network to ever be close to saturation on any node. The delays for low latency jobs on smaller clusters are predominantly artificial due to the nature of most RPC being ping-response and most design and testing done for large clusters of machines that only run a couple maps or reduces per TaskTracker. Do the above, and you won't be nearly as sensitive to the size of data per task for low latency jobs as out-of-the-box Hadoop. Your overall utilization will go up quite a bit. -Scott On 10/14/09 7:31 AM, Chris Seline ch...@searchles.com wrote: No, there doesn't seem to be all that much network traffic. Most of the time traffic (measured with nethogs) is about 15-30K/s on the master and slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe 5-10 seconds on a query that takes 10 minutes, but that is still less than what I see in scp transfers on EC2, which is typically about 30 MB/s. thanks Chris Jason Venner wrote: are your network interface or the namenode/jobtracker/datanodes saturated On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com wrote: I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance available with 20 compute units and 4x 400gb disks. I wrote some scripts to test many (100's) of configurations running a particular Hive query to try to make it as fast as possible, but no matter what I don't seem to be able to get above roughly 45% cpu utilization on the slaves, and not more than about 1.5% wait state. I have also measured network traffic and there don't seem to be bottlenecks there at all. Here are some typical CPU utilization lines from top on a slave when running a query: Cpu(s): 33.9%us, 7.4%sy, 0.0%ni, 56.8%id, 0.6%wa, 0.0%hi, 0.5%si, 0.7%st Cpu(s): 33.6%us, 5.9%sy, 0.0%ni, 58.7%id, 0.9%wa, 0.0%hi, 0.4%si, 0.5%st Cpu(s): 33.9%us, 7.2%sy, 0.0%ni, 56.8%id, 0.5%wa, 0.0%hi, 0.6%si, 1.0%st Cpu(s): 38.6%us, 8.7%sy, 0.0%ni, 50.8%id, 0.5%wa, 0.0%hi, 0.7%si, 0.7%st Cpu(s): 36.8%us, 7.4%sy, 0.0%ni, 53.6%id, 0.4%wa, 0.0%hi, 0.5%si, 1.3%st It seems like if tuned properly, I should be able to max out my cpu (or my disk) and get roughly twice the performance I am seeing now. None of the parameters I am tuning seem to be able to achieve this. Adjusting mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the io.file.buffer.size to 4096 does better than the default, but the rest of the values I am testing seem to have little positive effect. These are the parameters I am testing, and the values tried: io.sort.factor=2,3,4,5,10,15,20,25,30,50,100 mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9 0,0.93,0.95,0.97,0.98,0.99 io.bytes.per.checksum=256,512,1024,2048,4192 mapred.output.compress=true,false hive.exec.compress.intermediate=true,false hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9 0,0.93,0.95,0.97,0.98,0.99 mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200 mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m ,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200 mapred.merge.recordsBeforeProgress=5000,1,2,3 mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0 .80,0.90,0.93,0.95,0.99 io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95 ,0.99
Re: Hadoop Developer Needed
Whoops guys, sorry. Hit the reply too soon ;) Lajos Last-chance Architect wrote: Hi Alex, I'm a senior Java/J2EE developer/architect. I am American, but live abroad. I've been using Hadoop for the past year, but have many years experience in the field of large-scale data analysis, warehousing, etc. Let me know what you think. Cheers, Lajos alevin wrote: Overview The SQL Data Migration Specialist plays a crucial role in converting new Client's data onto Brilig's service platforms. We are looking for a talented and energetic full-time freelance programmer to work both remotely and onsite at our midtown Manhattan location. The Specialist will work with our clients' technical teams to determine the optimal formats and requirements to create files for subsequent import into a Brilig remote database during the implementation process. This process involves extracting, scrubbing, combining, transforming, validating and importing large data tables into final data sets suitable for loading into Brilig's defined databases. The Specialist will be responsible for creating/editing the database structure and writing of all import scripts and programs. The ability to work on multiple projects simultaneously while meeting tight deadlines is critical. This project will last for 3 months but may be extended or may eventually lead to a full time position in our fun and exciting startup. Must be able to travel to client meetings and work independently. Responsibilities: - Subject Matter Expert on software tools used in the entire data migration process from extraction to validation and load - Design, develop and execute quality data movement processes that are consistent, repeatable and scalable - Streamline testing, audit and validation processes through data scrubbing routines and presentation of audit reports prior to load - Roll out newly developed processes via documentation and training - Maintain and manage a template library of executed solutions to leverage against future opportunities - Identify, clarify, and resolve issues and risks, escalating them as needed - Build and nourish strong business relationships with external clients Please include: - Salary Requirements - Availability Experience - At least 3-5 years experience in the development of java applications - Use of XML and other protocols for data exchange between systems - SQL database design and implementation - Experience with Eclipse, Maven, and SVN a plus - Experience with Htable and Hadoop a big plus - Excellent communication skills with both technical and non-technical colleagues - Upper management and client facing skills - Interest in keeping up with technology advances PLEASE NOTE: US citizens and Green Card Holders and those authorized to work in the US only. We are unable to sponsor or transfer H-1B candidates. Contact: Alex Levin, COO Brilig ale...@brilig.com -- *** The 'Last-Chance' Architect www.galatea.com (US) +1 303 731 3116 (UK) +44 20 8144 4367 ***
Error in FileSystem.get()
Hey Folks, I am seeing a very weird problem in FileSystem.get(Configuration). I want to get a FileSystem given the configuration, so I am using Configuration conf = new Configuration(); _fs = FileSystem.get(conf); The problem is I am getting LocalFileSystem on some machines and Distributed on others. I am printing conf.get(fs.default.name) at all places and It returns the right HDFS value 'hdfs://dummy:9000' My expectation is looking at fs.default.name if it is hdfs:// it should give me a DistributedFileSystem always. Best Bhupesh
Re: Error in FileSystem.get()
Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.) Make sure your configs are consistent on all nodes across entire cluster and are pointing to correct fs. Hope it helps, Ashutosh On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com wrote: Hey Folks, I am seeing a very weird problem in FileSystem.get(Configuration). I want to get a FileSystem given the configuration, so I am using Configuration conf = new Configuration(); _fs = FileSystem.get(conf); The problem is I am getting LocalFileSystem on some machines and Distributed on others. I am printing conf.get(fs.default.name) at all places and It returns the right HDFS value 'hdfs://dummy:9000' My expectation is looking at fs.default.name if it is hdfs:// it should give me a DistributedFileSystem always. Best Bhupesh
Re: Error in FileSystem.get()
This code is not map/reduce code and run only on single machine and Also each node prints the right value for fs.default.name so it is Reading the right configuration file too .. The issue looks like use of CACHE in filesystem and someplace my code is setting up a wrong value If that is possible. Best Bhupesh On 10/15/09 1:46 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.) Make sure your configs are consistent on all nodes across entire cluster and are pointing to correct fs. Hope it helps, Ashutosh On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com wrote: Hey Folks, I am seeing a very weird problem in FileSystem.get(Configuration). I want to get a FileSystem given the configuration, so I am using Configuration conf = new Configuration(); _fs = FileSystem.get(conf); The problem is I am getting LocalFileSystem on some machines and Distributed on others. I am printing conf.get(fs.default.name) at all places and It returns the right HDFS value 'hdfs://dummy:9000' My expectation is looking at fs.default.name if it is hdfs:// it should give me a DistributedFileSystem always. Best Bhupesh
Map Recude code doubt
Hello All, I was wondering if our map reduce code can just return the location of the file? Or place the actual file in a given output directory by searching based on a keyword. Let me make myself clear If there are 100 image files in my HDFS and I want to extract one image file. If I give a keyword which matches the name of a file in my HDFS will my mapper and reducer code be able to locate that file and put back the original file in a given location?? Please let me know if you have need more information. Thanks -- View this message in context: http://www.nabble.com/Map-Recude-code-doubt-tp25916994p25916994.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Map Recude code doubt
Shwitzu, Why can't you just use query thru a Filesystem object and find the file you want? Lajos shwitzu wrote: Hello All, I was wondering if our map reduce code can just return the location of the file? Or place the actual file in a given output directory by searching based on a keyword. Let me make myself clear If there are 100 image files in my HDFS and I want to extract one image file. If I give a keyword which matches the name of a file in my HDFS will my mapper and reducer code be able to locate that file and put back the original file in a given location?? Please let me know if you have need more information. Thanks -- *** The 'Last-Chance' Architect www.galatea.com (US) +1 303 731 3116 (UK) +44 20 8144 4367 ***
Re: Error in FileSystem.get()
Bhupesh: If you use FileSystem.newInstance(), does that return the correct object type? This sidesteps CACHE. - A On Thu, Oct 15, 2009 at 3:07 PM, Bhupesh Bansal bban...@linkedin.comwrote: This code is not map/reduce code and run only on single machine and Also each node prints the right value for fs.default.name so it is Reading the right configuration file too .. The issue looks like use of CACHE in filesystem and someplace my code is setting up a wrong value If that is possible. Best Bhupesh On 10/15/09 1:46 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.) Make sure your configs are consistent on all nodes across entire cluster and are pointing to correct fs. Hope it helps, Ashutosh On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com wrote: Hey Folks, I am seeing a very weird problem in FileSystem.get(Configuration). I want to get a FileSystem given the configuration, so I am using Configuration conf = new Configuration(); _fs = FileSystem.get(conf); The problem is I am getting LocalFileSystem on some machines and Distributed on others. I am printing conf.get(fs.default.name) at all places and It returns the right HDFS value 'hdfs://dummy:9000' My expectation is looking at fs.default.name if it is hdfs:// it should give me a DistributedFileSystem always. Best Bhupesh
Re: type mismatch, newbie help
Hi, Unfortunately Mapper is now a class and from your call stack it seems that Mapper's default map implementation is being called (instead of the one you defined in your class), which is passing the LongWritable key to the collector. You should use @Override to have the compiler help you figure out why your map function doesn't have the same signature as the one defined in the base class. Best of luck. Ahad. On Thu, Oct 15, 2009 at 10:29 AM, yz5od2 woods5242-outdo...@yahoo.comwrote: Hi, I am new to Hadoop so this might be an easy question for someone to help me with. I continually am getting this exception (my code follows below) java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Running 0.20.1 I have a text file with lines of data separated by carriage returns. This is properly stored in a directory within HDFS. I only have a Mapping task for processing this file, after the mapping is done it should go straight to output, No reduce or combiner functions. I am just trying to test to see if this will run. The mapper just takes the data sent to it and adds it back to the collector as 2 text values. My Mapper: public class MyTestMapper extends MapperLongWritable,Text,Text,Text { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String key = key.toString(); String line = value.toString(); String id = extractId(line); String reformattedLine = reformatLine(line); context.write(new Text(id), new Text(reformattedLine)); } } My job submission code: - Job job = new Job(conf); job.setJarByClass(MyTestMapper.class); job.setMapperClass(MyTestMapper.class); FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt)); FileOutputFormat.setOutputPath(job, new Path(/myDir/output/results-+System.currentTimeMillis()+.txt)); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.submit();