How to select random n records using mapreduce ?
Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang
Re: Comparing two logs, finding missing records
I believe you meant, SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid WHERE LOG2.recordid is null. (this would produce set of records in LOG1 and which are not present in LOG2). In PIG, we have to add additional filter with is null condition. ~Rajesh.B On Mon, Jun 27, 2011 at 6:34 AM, Bharath Mundlapudi bharathw...@yahoo.comwrote: SQL: SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid; PIG: data = JOIN LOG1 BY recordid LEFT OUTER, LOG2 BY recordid; DUMP data; If you need more PIG help, please post in PIG email alias. -Bharath From: Mark Kerzner markkerz...@gmail.com To: common-user@hadoop.apache.org; Bharath Mundlapudi bharathw...@yahoo.com Sent: Sunday, June 26, 2011 5:50 PM Subject: Re: Comparing two logs, finding missing records Bharath, how would a Pig query look like? Thank you, Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi bharathw...@yahoo.com wrote: If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join. -Bharath From: Mark Kerzner markkerz...@gmail.com To: Hadoop Discussion Group core-u...@hadoop.apache.org Sent: Saturday, June 25, 2011 9:39 PM Subject: Comparing two logs, finding missing records Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark -- ~Rajesh.B
tar or hadoop archive
We use hadoop/hdfs to archive data. I archive a lot of file by creating one large tar file and then placing to hdfs. Is it better to use hadoop archive for this or is it essentially the same thing? -- --- Get your facts first, then you can distort them as you please.--
RE: Queue support from HDFS
Saumitra, Two questions come to mind that could help you narrow down a solution: 1) How quickly do the downstream processes need the transformed data? Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block size, then you are obviously going to be working more towards the strong suit of vanilla MR. 2) What else will be running on the cluster? Reason: If this is primarily setup for this use case then how often it runs / what resources it consumes when it does only needs to be optimized if it can't process them fast enough. If it is not then you could always setup a separate pool for this in the fairscheduler and allow for this to use a certain amount of overhead on the cluster when these events are being generated. Outside of the fact that you would have a lot of small files on the cluster (which can be resolved by running a nightly job to blob them and then delete originals) I am not sure I would be too concerned about at least trying out this method. It would be helpful to know the size and type of data coming in as well as what type of operation you are looking to do if you would like a more concrete suggestion. Log data is a prime example of this type of workflow and there are many suggestions out there as well as projects that attempt to address this (i.e. Chukwa). HTH, Matt -Original Message- From: saumitra.shahap...@gmail.com [mailto:saumitra.shahap...@gmail.com] On Behalf Of Saumitra Shahapure Sent: Friday, June 24, 2011 12:12 PM To: common-user@hadoop.apache.org Subject: Queue support from HDFS Hi, Is queue-like structure supported from HDFS where stream of data is processed when it's generated? Specifically, I will have stream of data coming; and data independent operation needs to be applied to it (so only Map function, reducer is identity). I wish to distribute data among nodes using HDFS and start processing it as it arrives, preferably in single MR job. I agree that it can be done by starting new MR job for each batch of data, but is starting many MR jobs frequently for small data chunks a good idea? (Consider new batch arrives after every few sec and processing of one batch takes few mins) Thanks, -- Saumitra S. Shahapure This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: error in reduce task
On 24/06/11 18:16, Niels Boldt wrote: Hi, I'm running nutch in pseudo cluster, eg all daemons are running on the same server. I'm writing to the hadoop list, as it looks like a problem related to hadoop Some of my jobs partially fails and in the error log I get output like 2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201106231520_0190_r_00_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201106231520_0190_r_00_0 copy failed: attempt_201106231520_0190_m_00_0 from worker1 2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: worker1 The above basically said that my worker is unknown, but I can't really make any sense of it. Other jobs running before, at the same time or after completes fine without any error messages and without any changes on the server. Also other reduce task in the same run has succeded. So it looks like that my worker sometimes 'disappear' and can't be reached. If the worker had disappeared of the net, you'd be more likely to see a NoRouteToHost My current theory is that it only happens when there are a couple of jobs running at the same time. Is that a plausible explanation Would anybody have some suggestions how I could get more infomation from the system, or point me in a direction where I should look(I'm also quite new to hadoop) I'd assume that one machine in the cluster doesn't have an /etc/hosts entry to worker1, or that the DNS server is suffering under load. If you can, put the host lists into the /etc/hosts table instead of relying on DNS. If you do it on all machines, it avoids having to work out which one is playing up. That said, some better logging of which host is trying to make the connection would be nice
Re: Reading HDFS files via Spring
On Sun, 26 Jun 2011 17:34:34 -0700, Mark static.void@gmail.com wrote: Hello all, We have a recommendation system that reads in similarity data via a Spring context.xml as follows: bean id=similarity class=org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity constructor-arg value=/var/data/similarity.data/ /bean Is it possible to use Hadoop/HDFS with Spring? We would love to be able to use something like: constructor-arg value=hdfs://user/mark/similarity.data Can this (easily) be accomplished? I didn't have to do quite the same thing, but I was trying to load an ApplicationContext using SpringBean files kept in HDFS. It was pretty straightforward to throw together an HDFSXMLApplicationContext class (and some necessary supporting classes), so I'd be surprised if it would be hard to tweak other Spring classes similarly. In this case, though, it looks like your problem isn't actually with Spring so much as it is with the FileItemSimilarity class; it doesn't have a constructor which takes a Path argument. You might be able to extend that class and add the kind of constructor you want to use, though.
Re: Computing overlap of two files with hadoop
Hi, I have posted the question to stackoverflow, where I have also clearified my problem a bit. If you have a solution, please respond there (if its not too much of a hassle): http://stackoverflow.com/questions/6469171/computing-set-intersection-and-set-difference-of-the-records-of-two-files-with-ha Best regards, Claus On 06/24/2011 12:44 PM, Claus Stadler wrote: Hi, My problem is as follows: I have two input files, and I want to determine a) The number of lines which only occur in file 1 b) The number of lines which only occur in file 2 c) The number of lines common to both (e.g. in regard to string equality) Exaple: File 1: a b c File 2: a d Desired output for each case: lines_only_in_1: 2 (b, c) lines_only_in_2: 1 (d) lines_in_both:1 (a) Basically my approach is as follows: I wrote my own LineRecordReader, so that the mapper receives a pair consisting of the line (text) and a byte indicating the source file (either 0 or 1). The mapper only returns the pair again so actually it does nothing. However, the side effect is, that the combiner receives a MapLine, IterableSourceId (where SourceId is either 0 or 1). Now, for each line I can get the set of sources it appears in. Therefore, I could write a combiner that counts for each case (a, b, c) the number of lines (Listing 1) The combiner then outputs a 'summary' only on cleanup (is that safe?). So this summary looks like: in_a_distinct_count_total 7531 in_b_distinct_count_total 3190 out_common_distinct_count_total 901 In the reducer I then only sum up the values for these summaries. However, the main problem is, that I need to treat both source files as a single virtual file which yield records of the form (line, sourceId) // sourceId either 0 or 1 And I am not sure how to achieve that. So the quesion is whether I can avoid preprocessing and mergind the files before hand, and do that on-the-fly with a something like a virtually-merged-file reader and custom record reader. Any code example is much appreciated. Best regards, Claus Listing 1: public static class SourceCombiner extends ReducerText, ByteWritable, Text, LongWritable { private long countA = 0; private long countB = 0; private long countC = 0; // C = lines (c)ommon to both sources @Override public void reduce(Text key, IterableByteWritable values, Context context) throws IOException, InterruptedException { SetByte fileIds = new HashSetByte(); for (ByteWritable val : values) { byte fileId = val.get(); fileIds.add(fileId); } if(fileIds.contains((byte)0)) { ++countA; } if(fileIds.contains((byte)1)) { ++countB; } if(fileIds.size() = 2) { ++countC; } } protected void cleanup(Context context) throws java.io.IOException, java.lang.InterruptedException { context.write(new Text(in_a_distinct_count_total), new LongWritable(countA)); context.write(new Text(in_b_distinct_count_total), new LongWritable(countB)); context.write(new Text(out_common_distinct_count_total), new LongWritable(countC)); }
RE: How to select random n records using mapreduce ?
I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes
RE: Performance Tunning
If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: How to select random n records using mapreduce ?
Wait - Habermaas like in Critical Theory -Original Message- From: Habermaas, William [mailto:william.haberm...@fatwire.com] Sent: Monday, June 27, 2011 2:55 PM To: common-user@hadoop.apache.org Subject: RE: How to select random n records using mapreduce ? I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: How to select random n records using mapreduce ?
If the incoming data is unique you can create a hash of the data and then do a modulus of the hash to select a random set. So if you wanted 10% of the data randomly: hash % 10 == 0 Gives a random 10% On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com wrote: I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Why I cannot see live nodes in a LAN-based cluster setup?
Hi Everyone: I am quite new to hadoop here. I am attempting to set up Hadoop locally in two machines, connected by LAN. Both of them pass the single-node test. However, I failed in two-node cluster setup, as shown in the 2 cases below: 1) set one as dedicated namenode and the other as dedicated datanode 2) set one as both name- and data-node, and the other as just datanode I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues cleared, thus I can always observe the startup of daemon in every datanode. However, by website of *http://(URI of namenode):50070 *it shows only 0 live node for (1) and 1 live node for (2), which is the same as the output by command-line *hadoop dfsadmin -report* Generally it appears that from the namenode you cannot observe the remote datanode alive, let alone a normal across-node MapReduce execution. Could anyone give some hints / instructions at this point? I really appreciate it! Thank. Best Regards Yours Sincerely Jingwei Lu
RE: Why I cannot see live nodes in a LAN-based cluster setup?
Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both machines? Also have you checked the logs on either to see if there are any errors? Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:24 PM To: HADOOP MLIST Subject: Why I cannot see live nodes in a LAN-based cluster setup? Hi Everyone: I am quite new to hadoop here. I am attempting to set up Hadoop locally in two machines, connected by LAN. Both of them pass the single-node test. However, I failed in two-node cluster setup, as shown in the 2 cases below: 1) set one as dedicated namenode and the other as dedicated datanode 2) set one as both name- and data-node, and the other as just datanode I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues cleared, thus I can always observe the startup of daemon in every datanode. However, by website of *http://(URI of namenode):50070 *it shows only 0 live node for (1) and 1 live node for (2), which is the same as the output by command-line *hadoop dfsadmin -report* Generally it appears that from the namenode you cannot observe the remote datanode alive, let alone a normal across-node MapReduce execution. Could anyone give some hints / instructions at this point? I really appreciate it! Thank. Best Regards Yours Sincerely Jingwei Lu This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Why I cannot see live nodes in a LAN-based cluster setup?
Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 54310: starting 2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage DS-87816363-127.0.0.1-50010-1309207502566 -- slave: datanode.log: 1 2011-06-27 13:45:00,335 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: 2 / 3 STARTUP_MSG: Starting DataNode 4 STARTUP_MSG: host = hdl.ucsd.edu/127.0.0.1 5 STARTUP_MSG: args = [] 6 STARTUP_MSG: version = 0.20.2 7 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 8 / 9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s). 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s). 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s). 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s). 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 4 time(s). 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s). 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s). 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s). 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s). 18 2011-06-27 13:45:11,678 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 9 time(s). 19 2011-06-27 13:45:11,679 INFO org.apache.hadoop.ipc.RPC: Server at hdl.ucsd.edu/127.0.0.1:54310 not available yet, Z... (just guess, is this
RE: Why I cannot see live nodes in a LAN-based cluster setup?
http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 54310: starting 2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage DS-87816363-127.0.0.1-50010-1309207502566 -- slave: datanode.log: 1 2011-06-27 13:45:00,335 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: 2 / 3 STARTUP_MSG: Starting DataNode 4 STARTUP_MSG: host = hdl.ucsd.edu/127.0.0.1 5 STARTUP_MSG: args = [] 6 STARTUP_MSG: version = 0.20.2 7 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 8 / 9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s). 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s). 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s). 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s). 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 4 time(s). 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s). 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s). 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s). 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s). 18 2011-06-27 13:45:11,678 INFO
RE: Why I cannot see live nodes in a LAN-based cluster setup?
As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows it is 127.0.0.1 and the ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is trying to bind to itself as if it was still in single machine mode. Make sure that you have correctly pushed the URI for the NN into the config files on both machines and then bounce DFS. Matt -Original Message- From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com] Sent: Monday, June 27, 2011 4:08 PM To: common-user@hadoop.apache.org Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup? http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 54310: starting 2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage DS-87816363-127.0.0.1-50010-1309207502566 -- slave: datanode.log: 1 2011-06-27 13:45:00,335 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: 2 / 3 STARTUP_MSG: Starting DataNode 4 STARTUP_MSG: host = hdl.ucsd.edu/127.0.0.1 5 STARTUP_MSG: args = [] 6 STARTUP_MSG: version = 0.20.2 7 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 8 / 9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s). 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s). 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s). 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s). 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310.
Re: Performance Tunning
Matt, Thanks for your help! I think I get it now, but this part is a bit confusing: * * *so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.* * * If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2 Processes/Core = 32 Processes Total So my configuration mapred-site.xml should include these props: *property* * namemapred.map.tasks/name* * value28/value* */property* *property* * namemapred.reduce.tasks/name* * value4/value* */property* * * Is that correct? On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Why I cannot see live nodes in a LAN-based cluster setup?
Hi Matt and Jeff: Thanks a lot for your instructions. I corrected the mistakes in conf files of DN, and now the log on DN becomes: 2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s). 2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s). 2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s). 2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s). 2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s). 2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s). 2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s). 2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s). 2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s). 2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s). 2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at clock.ucsd.edu/132.239.95.91:54310 not available yet, Z... Seems DN is trying to bind with NN but always fails... Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows it is 127.0.0.1 and the ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is trying to bind to itself as if it was still in single machine mode. Make sure that you have correctly pushed the URI for the NN into the config files on both machines and then bounce DFS. Matt -Original Message- From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com] Sent: Monday, June 27, 2011 4:08 PM To: common-user@hadoop.apache.org Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup? http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 54310: starting 2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89) at
Re: tar or hadoop archive
Yes, you can see a picture describing HAR files in this old blog post: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ -Joey On Mon, Jun 27, 2011 at 4:36 PM, Rita rmorgan...@gmail.com wrote: So, it does an index of the file? On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria j...@cloudera.com wrote: The advantage of a hadoop archive files is it lets you access the files stored in it directly. For example, if you archived three files (a.txt, b.txt, c.txt) in an archive called foo.har. You could cat one of the three files using the hadoop command line: hadoop fs -cat har:///user/joey/out/foo.har/a.txt You can also copy files out of the archive or use files in the archive as input to map reduce jobs. -Joey On Mon, Jun 27, 2011 at 3:06 AM, Rita rmorgan...@gmail.com wrote: We use hadoop/hdfs to archive data. I archive a lot of file by creating one large tar file and then placing to hdfs. Is it better to use hadoop archive for this or is it essentially the same thing? -- --- Get your facts first, then you can distort them as you please.-- -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- --- Get your facts first, then you can distort them as you please.-- -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Performance Tunning
Ok, So I tried putting the following config in the mapred-site.xml of all of my nodes configuration property namemapred.job.tracker/name valuename-node:54311/value /property property namemapred.map.tasks/name value7/value /property property namemapred.reduce.tasks/name value1/value /property property namemapred.tasktracker.map.tasks.maximum/name value7/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value /property /configuration but when I start a new job it gets stuck at 11/06/28 03:04:47 INFO mapred.JobClient: map 0% reduce 0% Any thoughts? Thanks for your help guys! On Mon, Jun 27, 2011 at 7:33 PM, Juan P. gordoslo...@gmail.com wrote: Matt, Thanks for your help! I think I get it now, but this part is a bit confusing: * * *so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.* * * If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2 Processes/Core = 32 Processes Total So my configuration mapred-site.xml should include these props: *property* * namemapred.map.tasks/name* * value28/value* */property* *property* * namemapred.reduce.tasks/name* * value4/value* */property* * * Is that correct? On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Why I cannot see live nodes in a LAN-based cluster setup?
At this point if that is the correct ip then I would see if you can actually ssh from the DN to the NN to make sure it can actually connect to the other box. If you can successfully connect through ssh then it's just a matter of figuring out why that port is having issues (netstat is your friend in this case). If you see it listening on 54310 then just power cycle the box and try again. Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 5:38 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi Matt and Jeff: Thanks a lot for your instructions. I corrected the mistakes in conf files of DN, and now the log on DN becomes: 2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s). 2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s). 2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s). 2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s). 2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s). 2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s). 2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s). 2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s). 2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s). 2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s). 2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at clock.ucsd.edu/132.239.95.91:54310 not available yet, Z... Seems DN is trying to bind with NN but always fails... Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows it is 127.0.0.1 and the ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is trying to bind to itself as if it was still in single machine mode. Make sure that you have correctly pushed the URI for the NN into the config files on both machines and then bounce DFS. Matt -Original Message- From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com] Sent: Monday, June 27, 2011 4:08 PM To: common-user@hadoop.apache.org Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup? http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: