Re: Hadoop 3.0 doesn't detect the correct conf folder
On 2017-12-21 00:25, Jeff Zhangwrote: > I tried the hadoop 3.0, and can start dfs properly, but when I start yarn, > it fails with the following error > ERROR: Cannot find configuration directory > "/Users/jzhang/Java/lib/hadoop-3.0.0/conf" > > Actually, this is not the correct conf folder. It should be > /Users/jzhang/Java/lib/hadoop-3.0.0/etc/hadoop. hdfs could detect that > correctly, but seems yarn couldn't, might be something wrong in yarn > starting script. This kind of weirdness is indicative of having YARN_CONF_DIR defined somewhere. - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: v2.8.0: Setting PID dir EnvVars: libexec/thing-config.sh or etc/hadoop/thing-env.sh ?
On 2017-06-14 18:50 (-0700), Kevin Buckleywrote: > So, > > is there a "to be preferred" file choice, between those two, > within which to set certain classes of EnvVars ? 2.x releases require a significant amount of duplication of environment variable settings. The YARN subsystem does not read from hadoop-env.sh at all. Because of this limitation, it means you'll need to configure HADOOP_PID_DIR in hadoop-env.sh, YARN_PID_DIR in yarn-env.sh, and HADOOP_MAPRED_PID_DIR in mapred-env.sh. Same is true for LOG_DIR, _OPT, and a few others. FWIW, this is one of the areas that got cleaned up in the massive shell script rewrite in 3.x. For example, setting HADOOP_PID_DIR in hadoop-env.sh will propagate to all of the other services. - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst
On 2016-07-30 20:12 (-0700), Shady Xuwrote: > Thanks Andrew, I know about the disk failure risk and that it's one of the > reasons why we should use JBOD. But JBOD provides worse performance than > RAID 0. It's not about failure: it's about speed. RAID0 performance will drop like a rock if any one disk in the set is slow. When all the drives are performing at peak, yes, it's definitely faster. But over time, drive speed will decline (sometimes to half speed or less!) usually prior to a failure. This failure may take a while, so in the mean time your cluster is getting slower ... and slower ... and slower ... As a result, JBOD will be significantly faster over the _lifetime_ of the disks vs. a comparison made _today_. - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: Supervisely, RAID0 provides best io performance whereas no RAID the worst
On 2016-07-30 20:12 (-0700), Shady Xuwrote: > Thanks Andrew, I know about the disk failure risk and that it's one of the > reasons why we should use JBOD. But JBOD provides worse performance than > RAID 0. And take into account the fact that HDFS does have other > replications and it will make one more replication on another DataNode when > disk failure happens. So why should we sacrifice performance to prevent > data loss which can naturally be avoided by HDFS? > - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: set reduced block size for a specific file
On Aug 27, 2011, at 12:42 PM, Ted Dunning wrote: There is no way to do this for standard Apache Hadoop. Sure there is. You can build a custom conf dir and point it to that. You *always* have that option for client settable options as a work around for lack of features/bugs. 1. Copy $HADOOP_CONF_DIR or $HADOOP_HOME/conf to a dir 2. modify the hdfs-site.xml to have your new block size 3. Run the following: HADOOP_CONF_DIR=mycustomconf hadoop dfs -put file dir Convenient? No. Doable? Definitely.
Re: hp sl servers with hadoop?
On Aug 24, 2011, at 7:42 AM, Koert Kuipers wrote: Does anyone have any experience using HP Proliant SL hardware for hadoop? Yes. We've got ~600 of them. We are currently using DL 160 and DL 180 servers, and the SL hardware seems to fit the bill for our new servers in many ways. However the shared power on chassis is not something i am fully comfortable with yet. The biggest problems that our hardware team report are: a) lack of hot swappable drives means a lot of manual work to replace bad disks b) the cabling is on the front, which is pretty much opposite of how the rest of our gear works Our people do tend to bring down both nodes when working on them. But I think that's more of a safety precaution than a requirement. But those events are relatively short lived so usually not a big deal WRT block replication.
Re: Simple Permissions Question
On Aug 24, 2011, at 5:25 PM, Time Less wrote: I'm at a loss here. Check out my groups and the directory permissions: [tellis@laxhadoop1-012 ~] :) groups tellis hdfs supergroup [tellis@laxhadoop1-012 ~] :) hadoop fs -mkdir /tmp/hive-tellis mkdir: org.apache.hadoop.security.AccessControlException: Permission denied: user=tellis, access=WRITE, inode=/tmp:hdfs:supergroup:drwxrwxr-x ... [hdfs@laxhadoop1-012 ~/tmp] :) hadoop fs -chmod 777 /tmp Then tellis has access to mkdir in /tmp. This tells me that you likely aren't getting your group information sent to Hadoop. I'm really confused. What am I missing? What version of Hadoop? Is this a secure version with security on? If you hop on the NameNode, what are your group settings? Certain versions only honor group information from the namenode. This is by design to prevent users lying to Hadoop about their identity. Also, if I help, do I get some free stuff in LoL? :)
Re: Hadoop cluster optimization
On Aug 22, 2011, at 3:00 AM, אבי ווקנין wrote: I assumed that the 1.7GB RAM will be the bottleneck in my environment that's why I am trying to change it now. I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and replaced them with 2 datanodes with 7.5GB RAM (Amazon EC2 large instance). This should allow you to bump up the memory and/or increase the task count. Is it OK that the datanodes are 64 bit while the namenode is still 32 bit? I've run several instance where the NN was 64-bit and the DNs were 32-bit. I can't think of a reason the reverse wouldn't work. The only thing that is really going to matter is if they are the same CPU architecture. (which, if you are running on EC2, will likely always be the case). Based on the new hardware I'm using, Are there any suggestions regarding the Hadoop configuration parameters? It really depends upon how much memory you need per task. Thus why task spill rate is important... :) One more thing, you asked: Are your tasks spilling? How can I check if my tasks spilling ? Check the task logs. If you aren't spilling, then you'll likely want to match task count=core count-1 unless mem is exhausted first. (i.e., tasks*mem should be avail mem).
Re: upgrade from Hadoop 0.21.0 to a newer version??
On Aug 21, 2011, at 11:00 PM, steven zhuang wrote: thanks Allen, I really wish there wasn't such a version 0.21.0. :) It is tricky (lots of config work), but you could always run the two versions in parallel on the same gear, distcp from 0.21 to 0.20.203, then shutdown the 0.21 instance.
Re: Hadoop cluster optimization
On Aug 21, 2011, at 7:17 PM, Michel Segel wrote: Avi, First why 32 bit OS? You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus. With only 1.7gb of mem, there likely isn't much of a reason to use a 64-bit OS. The machines (as you point out) are already tight on memory. 64-bit is only going to make it worse. 1.7 GB memory 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz Ubuntu Server 10.10 , 32-bit platform Cloudera CDH3 Manual Hadoop Installation (for the ones who are familiar with Amazon Web Services, I am talking about Small EC2 Instances/Servers) Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250 MB and 10 reduce tasks). Based on the above information, does anyone can recommend on a best practice configuration?? How many spindles? Are your tasks spilling? Do you thinks that when dealing with such a small cluster, and when processing such a small amount of data, is it even possible to optimize jobs so they would run much faster? Most of the time, performance issues are with the algorithm, not Hadoop.
Re: upgrade from Hadoop 0.21.0 to a newer version??
On Aug 19, 2011, at 12:39 AM, steven zhuang wrote: I updated my hadoop cluster from 0.20.2 to higher version 0.21.0 because of MAPREDUCE-1286, and now I have problem running a Hbase on it. I saw the 0.21.0 version is marked as unstable, unsupported, does not include security. My question is that can I upgrade the cluster from 0.21.0 to a newer version say 0.20.203 or future version 0.22? You won't be able to side-grade to 0.20.203/204. You should be able to upgrade to 0.22, 0.23, etc. Just be sure to process your edits file *prior* to upgrading. Otherwise it might explode.
Re: Question regarding Capacity Scheduler
On Aug 17, 2011, at 10:53 AM, Matt Davies wrote: Hello, I'm playing around with the Capacity Scheduler (coming from the Fair Scheduler), and it appears that a queue with jobs submitted by the same user are treated as FIFO. So, for example, if I submit job1 and job2 to the low queue as matt job1 will run to completion before job2 starts. Is there a way to share the resources in the queue between the 2 jobs when they both come from the same user? No. You'll need a different queue.
Re: Hadoop Streaming 0.20.2 and how to specify number of reducers per node -- is it possible?
On Aug 17, 2011, at 12:36 AM, Steven Hafran wrote: after reviewing the hadoop docs, i've tried setting the following properties when starting my streaming job; however, they don't seem to have any impact. -jobconf mapred.tasktracker.reduce.tasks.maximum=1 tasktracker is the hint: that's a server side setting. You're looking for the mapred.reduce.tasks settings.
Re: Why hadoop should be built on JAVA?
On Aug 15, 2011, at 9:00 PM, Chris Song wrote: Why hadoop should be built in JAVA? http://www.quora.com/Why-was-Hadoop-written-in-Java How will it be if HADOOP is implemented in C or Phython? http://www.quora.com/Would-Hadoop-be-different-if-it-were-coded-in-C-C++-instead-of-Java-How
Re: Hadoop cluster network requirement
On Jul 31, 2011, at 12:08 PM, jonathan.hw...@accenture.com jonathan.hw...@accenture.com wrote: I was asked by our IT folks if we can put hadoop name nodes storage using a shared disk storage unit. What do you mean by shared disk storage unit? There are lots of products out there that would claim this, so actual deployment semantics are important. Does anyone have experience of how much IO throughput is required on the name nodes? IO throughput is completely dependent dependent upon how many changes are being applied to the file system and frequency of edits log merging. In the majority of cases it is not much. What tends to happen where the storage is shared (such as a NAS) is that the *other* traffic blocks the writes for too long because it is overloaded and the NN declares it dead. What are the latency/data throughput requirements between the master and data nodes - can this tolerate network routing? If you mean different data centers, then no. If you mean same data center, but with routers in between, then probably yes, but you add several more failure points, so this isn't recommended. Did anyone published any throughput requirement for the best network setup recommendation? Not that I know of. It is very much dependent upon the actual workload being performed. But I wouldn't deploy anything slower than a 1:4 overcommit (uplink-to-host) on the DN side for anything real/significant. This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. Lawyers are funny people. I wonder how much they got paid for this one.
Re: Moving Files to Distributed Cache in MapReduce
We really need to build a working example to the wiki and add a link from the FAQ page. Any volunteers? On Jul 29, 2011, at 7:49 PM, Michael Segel wrote: Here's the meat of my post earlier... Sample code on putting a file on the cache: DistributedCache.addCacheFile(new URI(path+MyFileName,conf)); Sample code in pulling data off the cache: private Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); boolean exitProcess = false; int i=0; while (!exit){ fileName = localFiles[i].getName(); if (fileName.equalsIgnoreCase(model.txt)){ // Build your input file reader on localFiles[i].toString() exitProcess = true; } i++; } Note that this is SAMPLE code. I didn't trap the exit condition if the file isn't there and you go beyond the size of the array localFiles[]. Also I set exit to false because its easier to read this as Do this loop until the condition exitProcess is true. When you build your file reader you need the full path, not just the file name. The path will vary when the job runs. HTH -Mike From: michael_se...@hotmail.com To: common-user@hadoop.apache.org Subject: RE: Moving Files to Distributed Cache in MapReduce Date: Fri, 29 Jul 2011 21:43:37 -0500 I could have sworn that I gave an example earlier this week on how to push and pull stuff from distributed cache. Date: Fri, 29 Jul 2011 14:51:26 -0700 Subject: Re: Moving Files to Distributed Cache in MapReduce From: rogc...@ucdavis.edu To: common-user@hadoop.apache.org jobConf is deprecated in 0.20.2 I believe; you're supposed to be using Configuration for that On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com wrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
Re: Hadoop cluster network requirement
On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote: Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems: 1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!) 2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm 3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats 4) I don't even want to think about rebalancing. ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it. If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
Re: TaskTrackers behind NAT
On Jul 18, 2011, at 12:53 PM, Ben Clay wrote: I'd like to spread Hadoop across two physical clusters, one which is publicly accessible and the other which is behind a NAT. The NAT'd machines will only run TaskTrackers, not HDFS, and not Reducers either (configured with 0 Reduce slots). The master node will run in the publicly-available cluster. Off the top, I doubt it will work : MR is bi-directional, across many random ports. So I would suspect there is going to be a lot of hackiness in the network config to make this work. 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks fetch intermediate data from http:// http://%3ctasktracker%3e:50060/mapOutput tasktracker:50060/mapOutput, correct ? I'm getting Too many fetch-failures with no open ports, so I assume the Reduce tasks need to pull the intermediate data instead of Map tasks pushing it. Correct. Reduce tasks pull. 2. Although the NAT'd machines have unique IPs and reach the outside, the DHCP is not assigning them hostnames. Therefore, when they join the JobTracker I get tracker_localhost.localdomain:localhost.localdomain/127.0.0.1 on the machine list page. Is there some way to force Hadoop to refer to them via IP instead of hostname, since I don't have control over the DHCP? I could manually assign a hostname via /etc/hosts on each NAT'd machine, but these are actually VMs and I will have many of them receiving semi-random IPs, making this an ugly administrative task. Short answer: no. Long answer: no, fix your DHCP and/or do the /etc/hosts hack.
Re: Which release to use?
On Jul 18, 2011, at 5:01 PM, Rita wrote: I made the big mistake by using the latest version, 0.21.0 and found bunch of bugs so I got pissed off at hdfs. Then, after reading this thread it seems I should of used 0.20.x . I really wish we can fix this on the website, stating 0.21.0 as unstable. It is stated in a few places on the website that 0.21 isn't stable: http://hadoop.apache.org/common/releases.html#23+August%2C+2010%3A+release+0.21.0+available It has not undergone testing at scale and should not be considered stable or suitable for production. ... and ... http://hadoop.apache.org/common/releases.html#Download 0.21.X - unstable, unsupported, does not include security and it isn't in the stable directory on the apache download mirrors.
Re: Which release to use?
On Jul 18, 2011, at 6:02 PM, Rita wrote: I am a dimwit. We are conditioned by marketing that a higher number is always better. Experience tells us that this is not necessarily true.
Re: Lack of data locality in Hadoop-0.20.2
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: I agree that the scheduler has lesser leeway when the replication factor is 1. However, I would still expect the number of data-local tasks to be more than 10% even when the replication factor is 1. How did you load your data? Did you load it from outside the grid or from one of the datanodes? If you loaded from one of the datanodes, you'll basically have no real locality, especially with a rep factor of 1.
Re: How to query a slave node for monitoring information
On Jul 12, 2011, at 3:02 PM, samdispmail-tru...@yahoo.com samdispmail-tru...@yahoo.com wrote: I am new to Hadoop, and I apologies if this was answered before, or if this is not the right list for my question. common-user@ would likely have been better, but I'm too lazy to forward you there today. :) I am trying to do the following: 1- Read monitoring information from slave nodes in hadoop 2- Process the data to detect nodes failure (node crash, problems in requests ... etc) and decide if I need to restart the whole machine. 3- Restart the machine running the slave facing problems At scale, one doesn't monitor individual nodes for up/down. Verifying the up/down of a given node will drive you insane and is pretty much a waste of time unless the grid itself is under-configured to the point that *every* *node* *counts*. (If that is the case, then there are bigger issues afoot...) Instead, one should monitor the namenode and jobtracker and alert based on a percentage of availability. This can be done in a variety of ways, depending upon which version of Hadoop is in play. For 0.20.2, a simple screen scrape is good enough. I recommend warn on 10%, alert on 20%, panic on 30%. My question is for step 1- collecting monitoring information. I have checked Hadoop monitoring features. But currently you can forward the motioning data to files, or to Ganglia. Do you want monitoring information or metrics information? Ganglia is purely a metrics tool. Metrics are a different animal. While it is possible to alert on them, in most cases they aren't particular useful in a monitoring context other than up/down.
Re: Cluster Tuning
On Jul 11, 2011, at 9:28 AM, Juan P. wrote: * property* *namemapred.child.java.opts/name* *value-Xmx400m/value* * /property* Single core machines with 600MB of RAM. 2x400m = 800m just for the heap of the map and reduce phases, not counting the other memory that the jvm will need. io buffer sizes aren't adjusted downward either, so you're likely looking at a swapping + spills = death scenario. slowstart set to 1 is going to be pretty much required.
Re: Job Priority Hadoop 0.20.203
On Jul 6, 2011, at 5:22 AM, Nitin Khandelwal wrote: Hi, I am using Hadoop 0.20.203 with the new API ( mapreduce package) . I want to use Jobpriority, but unfortunately there is no option to set that in Job ( the option is there in 0.21.0). Can somebody plz tell me is there is a walkaround to set job priority? Job priority is slowly (read: unofficially) on its way to getting deprecated, if one takes the fact that cap sched now completely ignores it in 203. I, too, am sad about this.
Re: Dead data nodes during job excution and failed tasks.
On Jun 30, 2011, at 10:01 AM, David Ginzburg wrote: Hi, I am running a certain job which constantly cause dead data nodes (who come back later, spontaneously ). Check your memory usage during the job run. Chances are good the DataNode is getting swapped out.
Re: How does Hadoop manage memory?
On Jun 28, 2011, at 1:43 PM, Peter Wolf wrote: Hello all, I am looking for the right thing to read... I am writing a MapReduce Speech Recognition application. I want to run many Speech Recognizers in parallel. Speech Recognizers not only use a large amount of processor, they also use a large amount of memory. Also, in my application, they are often idle much of the time waiting for data. So optimizing what runs when is non-trivial. I am trying to better understand how Hadoop manages resources. Does it automatically figure out the right number of mappers to instantiate? The number of mappers correlates to the number of InputSplits, which is based upon the InputFormat. In most cases, this is equivalent to the number of blocks. So if a file is composed of 3 blocks, it will generate 3 mappers. Again, depending upon the InputFormat, the size of these splits may be manipulated via job settings. How? What happens when other people are sharing the cluster? What resource management is the responsibility of application developers? Realistically, *all* resource management is the responsibility of the operations and development teams. The only real resource protection/allocation system that Hadoop provides is task slots and, if enabled, some memory protection in the form of don't go over this much.On multi-tenant systems, a good neighbor view of the world should be adopted. For example, let's say each Speech Recognizer uses 500 MB, and I have 1,000,000 files to process. What would happen if I made 1,000,000 mappers, each with 1 Speech Recognizer? At 1m mappers, the JobTracker would likely explode under the weight first unless the Heap size was raised significantly. Each value that you see on the JT page--including those for each task--are kept in main memory. Is it only non-optimal because of setup time, or would the system try to allocate 500GB of memory and explode? If you have 1m map slots, yes, it would allocate .5TB of mem spread across each node.
Re: tasktracker maximum map tasks for a certain job
On Jun 21, 2011, at 9:52 AM, Jonathan Zukerman wrote: Hi, Is there a way to set the maximum map tasks for all tasktrackers in my cluster for a certain job? Most of my tasktrackers are configured to handle 4 maps concurrently, and most of my jobs don't care where does the map function run. But small part of my jobs requires that no two map functions will run from the same IP at the same time, i.e., the maximum number of map tasks for each tasktracker should be 1 for these jobs. http://wiki.apache.org/hadoop/FAQ#How_do_I_limit_.28or_increase.29_the_number_of_concurrent_tasks_a_job_may_have_running_total_at_a_time.3F
Re: controlling no. of mapper tasks
On Jun 20, 2011, at 12:24 PM, praveen.pe...@nokia.com praveen.pe...@nokia.com wrote: Hi there, I know client can send mapred.reduce.tasks to specify no. of reduce tasks and hadoop honours it but mapred.map.tasks is not honoured by Hadoop. Is there any way to control number of map tasks? What I noticed is that Hadoop is choosing too many mappers and there is an extra overhead being added due to this. For example, when I have only 10 map tasks, my job finishes faster than when Hadoop chooses 191 map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I want to set both map and reduce tasks to be 10 for max efficiency. http://wiki.apache.org/hadoop/FAQ#How_do_I_limit_.28or_increase.29_the_number_of_concurrent_tasks_a_job_may_have_running_total_at_a_time.3F
Re: Large startup time in remote MapReduce job
On Jun 21, 2011, at 2:02 PM, Harsh J wrote: If your jar does not contain code changes that need to get transmitted every time, you can consider placing them on the JT/TT classpaths ... which means you get to bounce your system every time you change code. Its ugly, but if the jar filename remains the same there shouldn't need to be any bouncing. Doable if there's no activity at replacement point of time? I have yet to find a jar file that never ever changes. It may take a year+, but it will change. The corollary is that you'll need to change it at the worst possible time and in an incompatible way such that older code will break and need to be upgraded. So don't do it. :)
Re: Large startup time in remote MapReduce job
On Jun 22, 2011, at 10:08 AM, Allen Wittenauer wrote: On Jun 21, 2011, at 2:02 PM, Harsh J wrote: If your jar does not contain code changes that need to get transmitted every time, you can consider placing them on the JT/TT classpaths ... which means you get to bounce your system every time you change code. Its ugly, but if the jar filename remains the same there shouldn't need to be any bouncing. Doable if there's no activity at replacement point of time? I have yet to find a jar file that never ever changes. (I suppose the exception to this rule is all the java code at places like NASA, JPL, etc involved with the space program. But that's totally cheating!)
Re: Upgrading namenode/secondary node hardware
On Jun 15, 2011, at 3:18 AM, Steve Loughran wrote: yes, put it in the same place on your HA storage and you may not even need to reconfigure it. If you didn't shut down the filesystem cleanly, you'll need to replay the edit logs. As a sidenote... Lots of weird incompatibilities have snuck into the code with the editslog between versions. You REALLY REALLY REALLY don't want to let editslog get processed by a different version. Shutdown the NN, DNs, etc. Then start the NN up to let it process the editslog. Then shutdown the NN again with a clean editslog and do the upgrade.
Re: NameNode heapsize
On Jun 13, 2011, at 5:52 AM, Steve Loughran wrote: Unless your cluster is bigger than Facebooks, you have too many small files +1 (I'm actually sort of surprised the NN is still standing with only 24mb. The gc logs would be interesting to look at.) I'd also likely increase the block size, distcp files to the new block size, and then replace the old files with the new files.
Re: Regarding reading image files in hadoop
On Jun 12, 2011, at 11:15 PM, Nitesh Kaushik wrote: Dear Sir/Madam, I am Nitesh kaushik, working with institution dealing in satellite images. I am trying to read bulk images using Hadoop but unfortunately that is not working, i am not getting any clue how to work with images in hadoop Hadoop is working great with text files and producing good results. Please tell me how can i deal with image data in hadoop. The typical problems with using images are: a) If you are using streaming, you'll likely need to encode the file or upgrade to a version of hadoop that support binary streams over streaming. b) Either need a custom InputFormat or set the split size to match the file size.
Re: Block size in HDFS
FYI, I've added this to the FAQ since it comes up every so often.
Re: concurrent job execution
On Jun 3, 2011, at 1:11 AM, Felix Sprick wrote: Hi, We are running MapReduce on Hbase tables and are trying to implement a scenario with MapReduce where tasks are submitted from a GUI application. This means that several users (currently 5-10) may use the system in parallel. In addition to moving from the FIFO/default scheduler, make sure your GUI application actually submits the jobs as different users. If they are the same user, most (all?) schedulers will still FIFO the jobs. (-Dhadoop.job.ugi in 0.20.2 and lower, doAs() in 0.20.203 and higher).
Re: Changing dfs.block.size
On Jun 6, 2011, at 12:09 PM, J. Ryan Earl wrote: Hello, So I have a question about changing dfs.block.size in $HADOOP_HOME/conf/hdfs-site.xml. I understand that when files are created, blocksizes can be modified from default. What happens if you modify the blocksize of an existing HDFS site? Do newly created files get the default blocksize and old files remain the same? Yes. Is there a way to change the blocksize of existing files; I'm assuming you could write MapReduce job to do it, but any build in facilities? You can use distcp to copy the files back onto the same fs in a new location. The new files should be in the new block size. Now you can move the new files where the old files used to live.
Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster
On May 27, 2011, at 7:26 AM, DAN wrote: You see you have 2 Solaris servers for now, and dfs.replication is setted as 3. These don't match. That doesn't matter. HDFS will basically flag any files written with a warning that they are under-replicated. The problem is that the datanode processes aren't running and/or aren't communicating to the namenode. That's what the java.io.IOException: File /tmp/hadoop-cfadm/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 means. It should also be pointed out that writing to /tmp (the default) is a bad idea. This should get changed. Also, since you are running Solaris, check the FAQ on some settings you'll need to do in order to make Hadoop's broken username detection to work properly, amongst other things.
Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster
On May 27, 2011, at 1:18 PM, Xu, Richard wrote: Hi Allen, Thanks a lot for your response. I agree with you that it does not matter with replication settings. What really bothered me is same environment, same configures, hadoop 0.20.203 takes us 3 mins, why 0.20.2 took 3 days. Can you pls. shed more light on how to make Hadoop's broken username detection to work properly? It's in the FAQ so that I don't have to do that. http://wiki.apache.org/hadoop/FAQ Also, check your logs. All your logs. Not just the namenode log.
Re: Hadoop tool-kit for monitoring
On May 17, 2011, at 1:01 PM, Mark question wrote: Hi I need to use hadoop-tool-kit for monitoring. So I followed http://code.google.com/p/hadoop-toolkit/source/checkout and applied the patch in my hadoop.20.2 directory as: patch -p0 patch.20.2 Looking at the code, be aware this is going to give incorrect results/suggestions for certain stats it generates when multiple jobs are running. It also seems to lack the algorithm should be rewritten and the data was loaded incorrectly suggestions, which is usually the proper answer for perf problems 80% of the time.
Re: Hadoop tool-kit for monitoring
On May 17, 2011, at 3:11 PM, Mark question wrote: So what other memory consumption tools do you suggest? I don't want to do it manually and dump statistics into file because IO will affect performance too. We watch memory with Ganglia. We also tune our systems such that a task will only take X amount. In other words, given an 8gb RAM: 1gb for the OS 1gb for the TT and DN 6gb for all tasks if we assume each task will take max 1gb, then we end up with 3 maps and 3 reducers. Keep in mind that the mem consumed is more than just JVM heap size.
Re: Rapid growth in Non DFS Used disk space
On May 13, 2011, at 10:48 AM, Todd Lipcon wrote: 2) Any ideas on what is driving the growth in Non DFS Used space? I looked for things like growing log files on the datanodes but didn't find anything. Logs are one possible culprit. Another is to look for old files that might be orphaned in your mapred.local.dir - there have been bugs in the past where we've leaked files. If you shut down the TaskTrackers, you can safely delete everything from within mapred.local.dirs. Part of our S.O.P. during Hadoop bounces is to wipe mapred.local out. The TT doesn't properly clean up after itself.
Re: Suggestions for swapping issue
On May 11, 2011, at 11:11 AM, Adi wrote: By our calculations hadoop should not exceed 70% of memory. Allocated per node - 48 map slots (24 GB) , 12 reduce slots (6 GB), 1 GB each for DataNode/TaskTracker and one JobTracker Totalling 33/34 GB allocation. It sounds like you are only taking into consideration the heap size. There is more memory allocated than just the heap... The queues are capped at using only 90% of capacity allocated so generally 10% of slots are always kept free. But that doesn't translate into how free the nodes, which you've discovered. Individual nodes should be configured based on the assumption that *all* slots will be used. The cluster was running total 33 mappers and 1 reducer so around 8-9 mappers per node with 3 GB max limit and they were utilizing around 2GB each. Top was showing 100% memory utilized. Which our sys admin says is ok as the memory is used for file caching by linux if the processes are not using it. Well, yes and no. What is the breakdown of that 100%? Is there any actually allocated to buffer cache or is it all user space? No swapping on 3 nodes. Then node4 just started swapping after the number of processes shot up unexpectedly. The main mystery are these excess number of processes on the node which went down. 36 as opposed to expected 11. The other 3 nodes were successfully executing the mappers without any memory/swap issues. Likely speculative execution or something else. But again: don't build machines with the assumption that only x% of the slots will get used. There is no guarantee in the system that says that free slots will be balanced across all nodes... esp when you take into consideration node failure. -Adi On Wed, May 11, 2011 at 1:40 PM, Michel Segel michael_se...@hotmail.comwrote: You have to do the math... If you have 2gb per mapper, and run 10 mappers per node... That means 20gb of memory. Then you have TT and DN running which also take memory... What did you set as the number of mappers/reducers per node? What do you see in ganglia or when you run top? Sent from a remote device. Please excuse any typos... Mike Segel On May 11, 2011, at 12:31 PM, Adi adi.pan...@gmail.com wrote: Hello Hadoop Gurus, We are running a 4-node cluster. We just upgraded the RAM to 48 GB. We have allocated around 33-34 GB per node for hadoop processes. Leaving the rest of the 14-15 GB memory for OS and as buffer. There are no other processes running on these nodes. Most of the lighter jobs run successfully but one big job is de-stabilizing the cluster. One node starts swapping and runs out of swap space and goes offline. We tracked the processes on that node and noticed that it ends up with more than expected hadoop-java processes. The other 3 nodes were running 10 or 11 processes and this node ends up with 36. After killing the job we find these processes still show up and we have to kill them manually. We have tried reducing the swappiness to 6 but saw the same results. It also looks like hadoop stays well within the memory limits allocated and still starts swapping. Some other suggestions we have seen are: 1) Increase swap size. Current size is 6 GB. The most quoted size is 'tons of swap' but note sure how much it translates to in numbers. Should it be 16 or 24 GB 2) Increase overcommit ratio. Not sure if this helps as a few blog comments mentioned it didn't help Any other hadoop or linux config suggestions are welcome. Thanks. -Adi
Re: configuration and FileSystem
On May 10, 2011, at 9:57 AM, Gang Luo wrote: I was confused by the configuration and file system in hadoop. when we create a FileSystem object and read/write something through it, are we writing to or reading from HDFS? Typically, yes. Could it be local file system? Yes. If yes, what determines which file system it is? Configuration object we used to create the FileSystem object? Yes. When I write mapreduce program which extends Configured implements Tool, I can get the right configuration by calling getConf() and use the FileSystem object to communicate with HDFS. What if I want to read/write HDFS in a separate Utility class? Where does the configuration come from? You need to supply it.
Re: our experiences with various filesystems and tuning options
On May 10, 2011, at 6:14 AM, Marcos Ortiz wrote: My prefered filesystem is ZFS, It's a shame that Linux support is very inmature yet. For that reason, I changed my PostgreSQL hosts to FreeBSD-8.0 to use ZFS like filesystem and it's really rocks. Had anyone tested a Hadoop cluster with this filesystem? On Solaris or FreeBSD? HDFS capacity numbers go really wonky on pooled storage systems like ZFS. Other than that, performance is more than acceptable vs. ext4. [Sorry, I don't have my benchmark numbers handy.]
Re: socket timeouts, dropped packages
On Apr 13, 2011, at 1:06 PM, Ferdy Galema wrote: @Allen/Arun My bad, I was not aware that cloudera releases could not be discussed here at all. I was thinking that even though cloudera releases are somewhat different, issues that are probably generic could still discussed here. (Surely I would use the cloudera lists when I'm pretty sure it's absolutely specific to cloudera). Unfortunately, all of the various forks of the Apache releases (regardless of where they come from) have diverged enough that the issues are rarely generic anymore, outside of those answered on the FAQ. :(
Re: namenode format error
On Apr 13, 2011, at 12:38 PM, Jeffrey Wang wrote: It's just in my home directory, which is an NFS mount. I moved it off NFS and it seems to work fine. Is there some reason it doesn't work with NFS? Locking on NFS--regardless of application--is a dice roll, especially when client/server are different OSes and platforms. Heck even when they are the same, it might not work. For example, on some older Linux kernels, flock() on NFS only created a lock locally and didn't tell the NFS server, allowing another client to claim a lock! (IIRC, fcntl did work...) Best bet is to make sure rpc.lockd (or whatever daemon provides NFS locking) is running/working and to use the highest compatible version of NFS. NFSv4, in particular, has file locking built into the protocol rather than as an 'external' service.
Re: HDFS Compatiblity
On Apr 5, 2011, at 5:22 AM, Matthew John wrote: Can HDFS run over a RAW DISK which is mounted over a mount point with no FIle System ? Or does it interact only with POSIX compliant File sytem ? It needs a POSIX file system.
Re: Including Additional Jars
On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote: Hi All I have created a map reduce job and to run on it on the cluster, i have bundled all jars(hadoop, hbase etc) into single jar which increases the size of overall file. During the development process, i need to copy again and again this complete file which is very time consuming so is there any way that i just copy the program jar only and do not need to copy the lib files again and again. i am using net beans to develop the program. kindly let me know how to solve this issue? This was in the FAQ, but in a non-obvious place. I've updated it to be more visible (hopefully): http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F
Re: Question on Streaming with RVM. Perhaps environment settings related.
On Apr 1, 2011, at 9:47 PM, Guang-Nan Cheng wrote: The problem is MapRed process seems can't load RVM. I added /etc/profile.d/rvm.sh in hadoop-env.sh. But the script still fails due to the same error. Add this to the .bashrc: [ -x /etc/profile ] . /etc/profile and make sure /etc/profile is executable.
Re: Problem in Job/TasK scheduling
On Apr 1, 2011, at 3:03 AM, Nitin Khandelwal wrote: Hi, I am right now stuck on the issue of division of tasks among slaves for a job. Currently, as far as I know, hadoop does not allow us to fix/determine in advance how many tasks of a job would run on each slave. I am trying to design a system, where I want a slave to execute only one task of one job type , however at the same time , be able to accept/execute task of jobs of other types. Is this kind of scheduling possible in hadoop. This is in the FAQ: http://wiki.apache.org/hadoop/FAQ#How_do_I_limit_.28or_increase.29_the_number_of_concurrent_tasks_running_on_a_node.3F
Re: distributed cache exceeding local.cache.size
On Mar 31, 2011, at 11:45 AM, Travis Crawford wrote: Is anyone familiar with how the distributed cache deals when datasets larger than the total cache size are referenced? I've disabled the job that caused this situation but am wondering if I can configure things more defensively. I've started building specific file systems on drives to store the map reduce spill space. It seems to be the only reliable way to prevent MR from going nuts. Sure, some jobs may fail, but that seems to be a better strategy than the alternative.
Re: Is anyone running Hadoop 0.21.0 on Solaris 10 X64?
On Mar 31, 2011, at 7:43 AM, XiaoboGu wrote: I have trouble browsing the file system vi namenode web interface, namenode saying in log file that th –G option is invalid to get the groups for the user. I don't but I suspect you'll need to enable one of the POSIX personalities before launching the namenode. In particular, this means putting /usr/xpg4/bin or /usr/xpg6/bin in the PATH prior to the SysV /usr/bin entry.
Re: changing node's rack
On Mar 26, 2011, at 3:50 PM, Ted Dunning wrote: I think that the namenode remembers the rack. Restarting the datanode doesn't make it forget. Correct. https://issues.apache.org/jira/browse/HDFS-870
Re: A way to monitor HDFS for a file to come live, and then kick off a job?
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote: I am not sure if this is the right listserv, forgive me if it is not. A better choice would likely be hdfs-user@, since this is really about watching files in HDFS. My goal is this: monitor HDFS until a file is create, and then kick off a job. Ideally I'd want to do this continuously, but the file would be create hourly (with some sort of variance). I guess I could make a script that would ping the server every 5 minutes or something, but I was wondering if there might be a more elegant way? Two ways off the top of my head: 1) Read/watch the edits stream 2) Read/watch the HDFS audit log Given the latter is text built by log4j, that should be relatively simple to implement. There was a JIRA asking for this functionally to be built in recently, btw.
Re: Datanode won't start with bad disk
On Mar 24, 2011, at 10:47 AM, Adam Phelps wrote: For reference, this is running hadoop 0.20.2 from the CDH3B4 distribution. Given that this isn't a standard Apache release, you'll likely be better served by asking Cloudera.
Re: A way to monitor HDFS for a file to come live, and then kick off a job?
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote: I am not sure if this is the right listserv, forgive me if it is not. A better choice would likely be hdfs-user@, since this is really about watching files in HDFS. My goal is this: monitor HDFS until a file is create, and then kick off a job. Ideally I'd want to do this continuously, but the file would be create hourly (with some sort of variance). I guess I could make a script that would ping the server every 5 minutes or something, but I was wondering if there might be a more elegant way? Two ways off the top of my head: 1) Read/watch the edits stream 2) Read/watch the HDFS audit log Given the latter is text built by log4j, that should be relatively simple to implement. There was a JIRA asking for this functionally to be built in recently, btw.
Re: CDH and Hadoop
On Mar 23, 2011, at 7:29 AM, Rita wrote: I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have some data nodes as CDH and the rest as regular Hadoop? I think most of the larger sites are running some form of modified Apache release, in some cases having migrated off of a CDH release. At LinkedIn, we've been using the Apache 0.20.2 release with 2 patches related to the capacity scheduler for over a year now. In our case, I never deployed CDH, other than a test setup. I opted not to use CDH in the CDH2 and CDH3 beta time frame due to some patches that I felt were not of a high quality as well as the potential for vendor lock-in. But I haven't looked at it in probably a year.
Re: kill-task syntax?
On Mar 23, 2011, at 2:02 PM, Keith Wiley wrote: hadoop job -kill-task and -fail-task don't work for me. I get this kind of error: Exception in thread main java.lang.IllegalArgumentException: TaskAttemptId string : task_201103101623_12995_r_00 is not properly formed at org.apache.hadoop.mapreduce.TaskAttemptID.forName(TaskAttemptID.java:170) at org.apache.hadoop.mapred.TaskAttemptID.forName(TaskAttemptID.java:108) at org.apache.hadoop.mapred.JobClient.run(JobClient.java:1787) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1956) How is that not a properly formed task id? I copied it straight off the job tracker? Killing a job generally works for mealthough much to my dismay the job often doesn't actually die, which is why I'm not attempting to kill the underlying tasks directly. It actually wants the attemptid, not the task id. :( So attempt_201103101623_12995_r_00_0 would have worked. There should be a JIRA filed
Re: keeping an active hdfs cluster balanced
On Mar 17, 2011, at 12:13 PM, Stuart Smith wrote: Parts of this may end up on the hbase list, but I thought I'd start here. My basic problem is: My cluster is getting full enough that having one data node go down does put a bit of pressure on the system (when balanced, every DN is more than half full). Usually around the ~80% full mark is when HDFS starts getting a bit wonky on super active grids. Your best bet is to either delete some data/store the data more efficiently, add more nodes, or upgrade the storage capacity of the nodes you have. The balancer is only going to save you for so long until the whole thing tips over. Anybody here have any idea how badly running the balancer on a heavily active system messes things up? (for hdfs/hbase - if anyone knows). I don't run HBase, but at Y! we used to run the balancer pretty much every day, even on super active grids. It 'mostly works' until you get to the point of no return, which it sounds like you are heading for... Any ideas? Or do I just need better hardware? Not sure if that's an option, though.. Depending upon how your systems are configured, something else to look at is how much space is getting ate by logs, mapreduce spill space, etc. A good daemon bounce might free up some stale handles as well.
Re: hadoop fs -rmr /*?
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote: On HDFS, anyone can run hadoop fs -rmr /* and delete everything. In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against. But /* might have slipped through the cracks. What are examples of systems people working with high-value HDFS data put in place so that they can sleep at night? I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature. :D OK, not really. In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment. But we also make sure that the base (ETL'd) data lives on multiple grids. Any other data should be reproducible from that base data.
Re: hadoop fs -rmr /*?
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote: On HDFS, anyone can run hadoop fs -rmr /* and delete everything. In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against. But /* might have slipped through the cracks. What are examples of systems people working with high-value HDFS data put in place so that they can sleep at night? I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature. :D OK, not really. In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment. But we also make sure that the base (ETL'd) data lives on multiple grids. Any other data should be reproducible from that base data.
Re: Changing a Namenode server
On Mar 16, 2011, at 7:17 AM, Lior Schachter wrote: Hi, We have been using hadoop/HDSF for some while and we want to upgrade the namenode server to a new (and stronger) machine. Can you please advise on the correct way to do this while minimizing system downtime and without losing any data. It is pretty much the same procedure as if you had perm. lost the NN hardware, more or less: 1) Build new NN machine 2) Shutdown old NN 3) Copy fsimage and edits to new box 4) Swap IPs/move A record/etc=20 5) Mount backup area 6) Bring up NN
Re: Will blocks of an unclosed file get lost when HDFS client (or the HDFS cluster) crashes?
No. If a close hasn't been committed to the file, the associated blocks/files disappear in both client crash and namenode crash scenarios. On Mar 13, 2011, at 10:09 PM, Sean Bigdatafun wrote: I meant an HDFS chunk (the size of 64MB), and I meant the version of 0.20.2 without append patch. I think even without the append patch, the previous 64MB blocks (in my example, the first 5 blocks) should be safe. Isn't it? On 3/13/11, Ted Dunning tdunn...@maprtech.com wrote: What do you mean by block? An HDFS chunk? Or a flushed write? The answer depends a bit on which version of HDFS / Hadoop you are using. With the append branches, things happen a lot more like what you expect. Without that version, it is difficult to say what will happen. Also, there are very few guarantees about what happens if the namenode crashes. There are some provisions for recovery, but none of them really have any sort of transactional guarantees. This means that there may be some unspecified time before the writes that you have done are actually persisted in a recoverable way. On Sun, Mar 13, 2011 at 9:52 AM, Sean Bigdatafun sean.bigdata...@gmail.comwrote: Let's say an HDFS client starts writing a file A (which is 10 blocks long) and 5 blocks have been writen to datanodes. At this time, if the HDFS client crashes (apparently without a close op), will we see 5 valid blocks for file A? Similary, at this time if the HDFS cluster crashes, will we see 5 valid blocks for file A? (I guess both answers are yes, but I'd have some confirmation :-) -- --Sean -- --Sean
Re: TaskTracker not starting on all nodes
(Removing common-dev, because this isn't a dev question) On Feb 26, 2011, at 7:25 AM, bikash sharma wrote: Hi, I have a 10 nodes Hadoop cluster, where I am running some benchmarks for experiments. Surprisingly, when I initialize the Hadoop cluster (hadoop/bin/start-mapred.sh), in many instances, only some nodes have TaskTracker process up (seen using jps), while other nodes do not have TaskTrackers. Could anyone please explain? Check your logs.
Re: Problem running a Hadoop program with external libraries
On Mar 8, 2011, at 1:21 PM, Ratner, Alan S (IS) wrote: We had tried putting all the libraries directly in HDFS with a pointer in mapred-site.xml: propertynamemapred.child.env/namevalueLD_LIBRARY_PATH=/user/ngc/lib/value/property as described in https://issues.apache.org/jira/browse/HADOOP-2838 but this did not work for us. Correct. This isn't expected to work. HDFS files are not directly accessible from the shell without some sort of action having taken place. In order for the above to work, anything reading the LD_LIBRARY_PATH environment variable would have to know that '/user/...' is a) inside HDFS and b) know how to access it. The reason why the distributed cache method works is because it pulls files from HDFS and places them in the local UNIX file system. From there, UNIX processes can now access them. HADOOP-2838 is really about providing a way for applications to get to libraries that are already installed at the UNIX level. (Although, in reality, it would likely be better if applications were linked with a better value provided for the runtime library search path -R/-rpath/ld.so.conf/crle/etc rather than using LD_LIBRARY_PATH.)
Re: LZO Packager on Centos 5.5 fails
On Mar 6, 2011, at 9:37 PM, phil young wrote: I do appreciate the body of work that this is all built on and your responses, but I am surprised that there isn't a more comprehensive set of How-Tos. I'll assist in writing up something like that when I get a chance. Are other people using CDH3B3 with 64-bit CentOS/RedHat, or is there something I'm missing? This is the Apache list, not the Cloudera list, so the question would probably better asked there.
Re: Is this a fair summary of HDFS failover?
I'm more than a little concerned that you missed the whole multiple directories--including a remote one--for the fsimage thing. That's probably the #1 thing that most of the big grids do to maintain the NN data. I can only remember one failure where the NFS copy wasn't used to recover a namenode in all the failures I've personally been involved (and that was an especially odd bug, not a NN failure, per se). The only reason to fall back to the 2ndary NN in 0.20 should be is if you've hit a similarly spectacular bug. Point blank: anyone who runs the NN without it writing to a remote copy doesn't know what they are doing. Also, until AvatarNode comes of age (which, from what I understand, FB has only been doing for very long themselves), there is no such thing as HA NN. We all have high hopes that it works out, but it likely isn't anywhere near ready for primetime yet. On Feb 14, 2011, at 2:52 PM, Mark Kerzner wrote: I completely agree, and I am using yours and the group's posting to define the direction and approaches, but I am also trying every solution - and I am beginning to do just that, the AvatarNode now. Thank you, Mark On Mon, Feb 14, 2011 at 4:43 PM, M. C. Srivas mcsri...@gmail.com wrote: I understand you are writing a book Hadoop in Practice. If so, its important that what's recommended in the book should be verified in practice. (I mean, beyond simply posting in this newsgroup - for instance, the recommendations on NN fail-over should be tried out first before writing about how to do it). Otherwise you won't know your recommendations really work or not. On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner markkerz...@gmail.com wrote: Thank you, M. C. Srivas, that was enormously useful. I understand it now, but just to be complete, I have re-formulated my points according to your comments: - In 0.20 the Secondary NameNode performs snapshotting. Its data can be used to recreate the HDFS if the Primary NameNode fails. The procedure is manual and may take hours, and there is also data loss since the last snapshot; - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with HA and act as a cold spare. The data loss is less than with Secondary NN, but it is still manual and potentially error-prone, and it takes hours; - There is an AvatarNode patch available for 0.20, and Facebook runs its cluster that way, but the patch submitted to Apache requires testing and the developers adopting it must do some custom configurations and also exercise care in their work. As a conclusion, when building an HA HDFS cluster, one needs to follow the best practices outlined by Tom White http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf , and may still need to resort to specialized NSF filers for running the NameNode. Sincerely, Mark On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas mcsri...@gmail.com wrote: The summary is quite inaccurate. On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, is it accurate to say that - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to recreate the HDFS if the Primary NameNode fails, but with the delay of minutes if not hours, and there is also some data loss; The Secondary NN is not a spare. It is used to augment the work of the Primary, by offloading some of its work to another machine. The work offloaded is log rollup or checkpointing. This has been a source of constant confusion (some named it incorrectly as a secondary and now we are stuck with it). The Secondary NN certainly cannot take over for the Primary. It is not its purpose. Yes, there is data loss. - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which replaces the Secondary NameNode. The Backup Node can be used as a warm spare, with the failover being a matter of seconds. There can be multiple Backup Nodes, for additional insurance against failure, and previous best common practices apply to it; There is no Backup NN in the manner you are thinking of. It is completely manual, and requires restart of the whole world, and takes about 2-3 hours to happen. If you are lucky, you may have only a little data loss (people have lost entire clusters due to this -- from what I understand, you are far better off resurrecting the Primary instead of trying to bring up a Backup NN). In any case, when you run it like you mention above, you will have to (a) make sure that the primary is dead (b) edit hdfs-site.xml on *every* datanode to point to the new IP address of the backup, and restart each datanode. (c) wait for 2-3 hours for all the block-reports from every restarted DN to finish 2-3 hrs afterwards: (d) after that, restart all TT and the JT to connect to the new NN (e) finally, restart all the clients (eg, HBase, Oozie,
Re: Externalizing Hadoop configuration files
See also https://issues.apache.org/jira/browse/HADOOP-5670 . On Feb 16, 2011, at 3:32 PM, Ted Yu wrote: You need to develop externalization yourself. Our installer uses place holders such as: property namefs.checkpoint.dir/name valuedataFolderPlaceHolder/dfs/namesecondary,backupFolderPlaceHolder/namesecondary/value They would be resolved at time of deployment. On Wed, Feb 16, 2011 at 3:06 PM, Jing Zang purplehear...@hotmail.comwrote: How do I externalize the parameters in hadoop configuration files so make it easier for the operations team?. Ideally something like: File: mapred-site.xml propertynamemapred.job.tracker/name value${hadoop.mapred}/value/property File: hadoop.conf (External to the app)hadoop.mapred=hadoop:9001 Anything as easy as Spring' sPropertyPlaceholderConfigurer or similar? How do you guys externalize the parameters to a configuration file outside the app? Thanks.
Re: User History Location
On Feb 11, 2011, at 6:12 AM, Alexander Schätzle wrote: Hello, I'm a little bit confused about the right key for specifying the User History Location in CDH3B3 (which is Hadoop 0.20.2+737). Could anybody please give me a short answer which key is the right one and which configuration file is the right one to place the key? 1) mapreduce.job.userhistorylocation ? 2) hadoop.job.history.user.location ? Is the mapred-site.xml the right config-file for this key? I don't know about Cloudera, but on the real Hadoop, you don't want to write these to HDFS. If the write fails (which can happen under a variety of circumstances), all job logging gets turned off.
Re: hadoop infrastructure questions (production environment)
On Feb 8, 2011, at 7:45 AM, Oleg Ruchovets wrote: Hi , we are going to production and have some questions to ask: We are using 0.20_append version (as I understand it is hbase 0.90 requirement). 1) Currently we have to process 50GB text files per day , it can grow to 150GB -- what is the best hadoop file size for our load and are there suggested disk block size for that size? What is the retention policy? You'll likely want something bigger than the default 64mb though if the plan is indefinitely and you process these files at every job run. -- We worked using gz and I saw that for every files 1 map task was assigned. Correct. gzip files are not splittable. What is the best practice: to work with gz files and save disc space or work without archiving ? I'd recommend converting them to something else (such as a SequenceFile) that can be block compressed. Lets say we want to get performance benefits and disk space is less critical. Then uncompress them. :D 2) Currently adding additional machine to the greed we need manually maintain all files and configurations. Is it possible to auto-deploy hadoop servers without the need to manually define each one on all nodes? We basically use bcfg2 to manage different sets of configuration files (NN, JT, DN/TT, and gateway). When a node is brought up, bcfg2 detects based upon the disk configuration what kind of machine it is and applies the appropriate configurations. The only files that really need to get changed when you add/subtract nodes are the ones on the NN and JT. The rest of the nodes are oblivious to the rest. 3) Can we change masters without reinstalling the entire grid Yes. Depending upon how you manage the service movement (see a previous discussion on this last week), at most you'll need to bounce the DN and TT processes.
Re: HDFS drive, partition best practice
On Feb 8, 2011, at 11:33 AM, Adam Phelps wrote: On 2/7/11 2:06 PM, Jonathan Disher wrote: Currently I have a 48 node cluster using Dell R710's with 12 disks - two 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD (mounted on /data/0 through /data/9) and listed separately in hdfs-site.xml. It works... mostly. The big issues you will encounter is losing a disk - the DataNode process will crash, and if you comment out the affected drive, when you replace it you will have 9 disks full to N% and one empty disk. If DataNode is going down after a single disk failure then you probably haven't set dfs.datanode.failed.volumes.tolerated in hdfs-site.xml. You can up that number to allow DataNode to tolerate dead drives. a) only if you have a version that supports it b) that only protects you on the DN side. The TT is, AFAIK, still susceptible to drive failures.
Re: Multiple various streaming questions
On Feb 4, 2011, at 7:46 AM, Keith Wiley wrote: I have since discovered that in the case of streaming, mapred.map.tasks is a good way to achieve this goal. Ironically, if I recall correctly, this seemingly obvious method for setting the number mappers did not work so well in my original nonstreaming case, which is why I resorted to the rather contrived method of calculating and setting mapred.max.split.size instead. mapred.map.tasks basically kicks in if the input size is less than a block. (OK, it is technically more complex than that, but ... whatever). Given what you said in the other thread, this makes a lot more sense now as to what is going on. Because all slots are not in use. It's a very larger cluster and it's excruciating that Hadoop partially serializes a job by piling multiple map tasks onto a single map in a queue even when the cluster is massively underutilized. Well, sort of. The only input hadoop has to go on is your filename input which is relatively tiny. So of course it is going to underutilize. This makes sense now. :)
Re: changing the block size
On Feb 6, 2011, at 2:24 PM, Rita wrote: So, what I did was decommission a node, remove all of its data (rm -rf data.dir) and stopped the hdfs process on it. Then I made the change to conf/hdfs-site.xml on the data node and then I restarted the datanode. I then ran a balancer to take effect and I am still getting 64MB files instead of 128MB. :-/ Right. As previously mentioned, changing the block size does not change the blocks of the previously written files. In other words, changing the block size does not act as a merging function at the datanode level. In order to change pre-existing files, you'll need to copy the files to a new location, delete the old ones, then mv the new versions back.
Re: Streaming data locality
On Feb 3, 2011, at 9:16 AM, Keith Wiley wrote: I've seen this asked before, but haven't seen a response yet. If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved. If I understand your question, the method of processing doesn't matter. The JobTracker places tasks based on input locality. So if you are providing the names of the file you want as input as -input, then the JT will use the locations of those blocks. (Remember: streaming.jar is basically a big wrapper around the Java methods and the parameters you pass to it are essentially the same as you'd provide to a real Java app.) Or are you saying your -input is a list of other files to read? In the case, there is no locality. But again, streaming or otherwise makes no real difference. Likewise, is there any easier way to make those files accessible other than using the -cacheFile flag? That requires building a very very long hadoop command (100s of files potentially). I'm worried about overstepping some command-line length limit...plus it would be nice to do this programatically, say with the DistributedCache.addCacheFile() command, but that requires writing your own driver, which I don't see how to do with streaming. Thoughts? I think you need to give a more concrete example of what you are doing. -cache is used for sending files with your job and should have no bearing on what your input is to your job. Something tells me that you've cooked something up that is overly complex. :D
Re: Multiple various streaming questions
On Feb 1, 2011, at 11:40 PM, Keith Wiley wrote: I would really appreciate any help people can offer on the following matters. When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first four parameters anywhere in command I always get a Streaming Command Failed! error. The last four work though. Note that some of those parameters (-files) do work when I a run a Hadoop job in the normal framework, just not when I specify the streaming jar. There are some issues with how the streaming jar processes the command line, especially in 0.20, in that they need to be in the correct order. In general, the -D's need to be *before* the rest of the streaming params. This is what works for me: hadoop \ jar \ `ls $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar` \ -Dmapred.reduce.tasks.speculative.execution=false \ -Dmapred.map.tasks.speculative.execution=false \ -Dmapred.job.name=oh noes aw is doing perl again \ -input ${ATTEMPTIN} \ -output ${ATTEMPTOUT} \ -mapper map.pl \ -reducer reduce.pl \ -file jobsvs-map1.pl \ -file jobsvs-reduce1.pl I have found examples online, but they always reference built-in classes. If I try to use my own class, the job tracker produces a Cannot run program org.uw.astro.coadd.Reducer2: java.io.IOException: error=2, No such file or directory error. I wouldn't be surprised if it is a bug. It might be worthwhile to dig into the streaming jar to figure out how it determines whether something is a class or not. [It might even do something dumb like is it org.apache.blah?] How do I force a single record (input file) to be processed by a single mapper to get maximum parallelism? All I found online was this terse description (of an example that gzips files, not my application): • Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input. • Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory These work, but are less than ideal. I don't understand exactly what that means and how to go about doing it. In the normal Hadoop framework I have achieved this goal by setting mapred.max.split.size small enough that only one input record fits (about 6MBs), but I tried that with my streaming job ala -jobconf mapred.max.split.size=X where X is a very low number, about as many as a single streaming input record (which in the streaming case is not 6MB, but merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't work, it sent multiple records to each mapper anyway. What you actually want to do is set mapred.min.split.size set to an extremely high value. Setting max.split.size only works on Combined- and MultiFile- InputFormat for some reason. Also, you might be able to change the inputformat. My experiences with doing this are Not Good(tm). Achieving 1-to-1 parallelism between map tasks, nodes, and input records is very import because my map tasks take a very long time to run, upwards of an hour. I cannot have them queueing up on a small number of nodes while there are numerous unused nodes (task slots) available to be doing work. If all the task slots are in use, why would you care if they are queueing up? Also keep in mind that if a node fails, that work will need to get re-done anyway.
Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2
On Jan 25, 2011, at 12:48 PM, Renaud Delbru wrote: As it seems that the capacity and fair schedulers in hadoop 0.20.2 do not allow a hard upper limit in number of concurrent tasks, do anybody know any other solutions to achieve this ? The specific change for capacity scheduler has been backported to 0.20.2 as part of https://issues.apache.org/jira/browse/MAPREDUCE-1105 . Note that you'll also need https://issues.apache.org/jira/browse/MAPREDUCE-1160 which fixes a logging bug in the JobTracker. Otherwise your logs will fill up.
Re: Draining/Decommisioning a tasktracker
On Jan 28, 2011, at 1:09 AM, rishi pathak wrote: Hi, Is there a way to drain a tasktracker. What we require is not to schedule any more map/red tasks onto a tasktracker(mark it offline) but still the running tasks should not be affected. Decommissioning task trackers was added in 0.21.
Re: Thread safety issues with JNI/native code from map tasks?
On Jan 28, 2011, at 5:47 PM, Keith Wiley wrote: On Jan 28, 2011, at 15:50 , Greg Roelofs wrote: Does your .so depend on any other potentially thread-unsafe .so that other (non-Hadoop) processes might be using? System libraries like zlib are safe (else they wouldn't make very good system libraries), but maybe some other research library or something? (That's a long shot, but I'm pretty much grasping at straws here.) Yeah, I dunno. It's a very complicated system that hits all kinds of popular conventional libraries: boost, eigen, countless other things. I doubt of it is being access however. This is a dedicated cluster so if my task is the only one running, then it's only concurrent with the OS itself (and the JVM and Hadoop). By chance, do you have jvm reuse turned on?
Re: rsync + hdfs
On Jan 11, 2011, at 6:32 PM, Mag Gam wrote: Is there a tool similar to rsync for hdfs? I would like to check for: ctime and size (the default behavior of rsync) and then sync if necessary. This would be a nice feature to add in the dfs utilities. distcp is the closest.
Re: No locks available
On Jan 11, 2011, at 2:39 AM, Adarsh Sharma wrote: Dear all, Yesterday I was working on a cluster of 6 Hadoop nodes ( Load data, perform some jobs ). But today when I start my cluster I came across a problem on one of my datanodes. Are you running this on NFS? 2011-01-11 12:55:57,031 INFO org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: No locks available
Re: monit? daemontools? jsvc? something else?
On Jan 6, 2011, at 12:39 AM, Otis Gospodnetic wrote: In the case of Hadoop, no. There has usually been at least a core dump, message in syslog, message in datanode log, etc, etc. [You *do* have cores enabled, right?] Hm, cores enabled what do you mean by that? Are you referring to JVM heap dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)? If not, I'm all eyes/ears! I'm talking about system level core dumps. i.e., ulimit -c and friends. [I'm much more of a systems programmer than a java guy, so ... ] You can definitely write Java code that will make the JVM crash due to misuse of threading libraries. There are also CPU, kernel, and BIOS bugs that I've seen that cause the JVM to crash. Usually jstack or a core will lead the way to a patching the system to work around these issues. We also have in place a monitor that checks the # of active nodes. If it falls below a certain percentage, then we get alerted and check on them en masse. Worrying about one or two nodes going down probably means you need more nodes. :D That's probably right. :) So what do you use for monitoring the # of active nodes? We currently have a custom plugin for Nagios that screen scrapes the NN and JT web UI. When a certain percentage of nodes dies, we get alerted that we need to take a look and start bringing stuff back up. [We used the same approach at Y!, so it does work at scale.] I'm hoping to replace this (and Ganglia) with something better over the next year ;)
Re: monit? daemontools? jsvc? something else?
On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote: Ah, more manual work! :( You guys never have JVM die just because? I just had a DN's JVM die the other day just because and with no obvious cause. Restarting it brought it back to life, everything recovered smoothly. Had some automated tool done the restart for me, I'd be even happier. In the case of Hadoop, no. There has usually been at least a core dump, message in syslog, message in datanode log, etc, etc. [You *do* have cores enabled, right?] We also have in place a monitor that checks the # of active nodes. If it falls below a certain percentage, then we get alerted and check on them en masse. Worrying about one or two nodes going down probably means you need more nodes. :D
Re: monit? daemontools? jsvc? something else?
On Jan 5, 2011, at 7:57 PM, Lance Norskog wrote: Isn't this what Ganglia is for? No. Ganglia does metrics, not monitoring. On 1/5/11, Allen Wittenauer awittena...@linkedin.com wrote: On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote: Ah, more manual work! :( You guys never have JVM die just because? I just had a DN's JVM die the other day just because and with no obvious cause. Restarting it brought it back to life, everything recovered smoothly. Had some automated tool done the restart for me, I'd be even happier. In the case of Hadoop, no. There has usually been at least a core dump, message in syslog, message in datanode log, etc, etc. [You *do* have cores enabled, right?] We also have in place a monitor that checks the # of active nodes. If it falls below a certain percentage, then we get alerted and check on them en masse. Worrying about one or two nodes going down probably means you need more nodes. :D -- Lance Norskog goks...@gmail.com
Re: any plans to deploy OSGi bundles on cluster?
On Jan 4, 2011, at 10:30 AM, Hiller, Dean (Contractor) wrote: I guess I meant in the setting for number of tasks in child JVM before teardown. In that case, it is nice to separate/unload my previous classes from the child JVM which OSGi does. I was thinking we may do 10 tasks / JVM setting which I thought meant have a Child process run 10 tasks before shutting down4 may be from one job and 4 from a new job with conflicting classes maybe. Will that work or is it not advised? I don't use the JVM re-use options, but I'm 99% certain that the task JVM's are not shared between jobs.
Re: When does Reduce job start
On Jan 4, 2011, at 10:53 AM, sagar naik wrote: The only reason, I can think of not starting a reduce task is to avoid the un-necessary transfer of map output data in case of failures. Reduce tasks also eat slots while doing the map output. On shared grids, this can be extremely bad behavior. Is there a way to quickly start the reduce task in such case ? Wht is the configuration param to change this behavior mapred.reduce.slowstart.completed.maps See http://wiki.apache.org/hadoop/LimitingTaskSlotUsage (from the FAQ 2.12/2.13 questions).
Re: monit? daemontools? jsvc? something else?
On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote: I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use tools like monit and daemontools (and a few other ones) to keep revive their Hadoop processes when they die. I'm not a fan of doing this for Hadoop processes, even TaskTrackers and DataNodes. The processes generally die for a reason, usually indicating that something is wrong with the box. Restarting those processes may potentially hide issues.
Re: Reduce Task Priority / Scheduler
On Dec 19, 2010, at 7:39 AM, Martin Becker wrote: Hello everybody, is there a possibility to make sure that certain/all reduce tasks, i.e. the reducers to certain keys, are executed in a specified order? This is Job internal, so the Job Scheduler is probably the wrong place to start? Does the order induced by the Comparable interface influence the execution order at all? It sounds like you should be using a different framework than map/reduce if you care about the execution order.
Re: Unbalanced disks - need to take down whole HDFS?
On Dec 16, 2010, at 6:32 AM, Erik Forsberg wrote: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F has a solution, but it starts with Take down HDFS. Is that really necessary - shouldn't taking down just that datanode, moving around the blocks, then start the datanode be good enough, or will that mess up some datastructure in the namenode? It won't mess up the namenode, but it is going to be busy replicating those blocks on that datanode while you move stuff around. Depending upon how much data is involved, you might find that after you bring the datanode back up the namenode will put you back in a state of major unbalance when it sends a wave of deletions for the extra replicas.
Re: Deprecated ... damaged?
On Dec 15, 2010, at 2:13 AM, maha wrote: Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
Re: Hadoop Certification Progamme
On Dec 15, 2010, at 9:26 AM, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Isn't that what Photoshop is for?
Re: files that don't really exist?
On Dec 13, 2010, at 3:14 PM, Seth Lepzelter wrote: Alright, a little further investigation along that line (thanks for the hint, can't believe I didn't think of that), shows that there's actually a carriage return character (%0D, aka \r) at the end of the filename. This falls into that you never forget your first time area. ;) I guess *, in hadoop's parlance, doesn't include a \r. That's a bug in HDFS, IMO. Please file a JIRA. got a \r into the command line, -rmr'ed that, it's now fixed. Awesome. :)
Re: files that don't really exist?
On Dec 13, 2010, at 8:51 AM, Seth Lepzelter wrote: I've got a smallish cluster of 12 nodes up from 6, that we're using to dip our feet into hadoop. One of my users has a few directories in his HDFS home which he was using to test, and which exist, according to hadoop fs -ls home directory ie: ... /user/ken/testoutput4 /user/ken/testoutput5 ... but if you do: hadoop fs -ls /user/ken/testoutput5 you get: ls: Cannot access /user/ken/testoutput5: No such file or directory. There is likely one or more spaces after the testoutput5 . Try using hadoop fs -ls /user/ken/*/* .
Re: Multicore Nodes
On Dec 11, 2010, at 3:09 AM, Rob Stewart wrote: Or - is there support in Hadoop for multi-core nodes? Be aware that writing a job that is specifically uses multi-threaded tasks usually means that a) you probably aren't really doing map/reduce anymore and b) the job will likely tickle bugs in the system, as some APIs are more multi-thread safe than others. (See HDFS-1526, for example).
Re: Scheduler in Hadoop MR
On Dec 7, 2010, at 6:47 PM, Harsh J wrote: 1 - When we've two JobTrackers running simultaneously, each JobTracker is running in a separate process? You can't run simultaneous JobTrackers for the same data-cluster AFAIK; only one JT process can exist. Did you mean jobs? Sure you can. Configure the task trackers to talk to specific job trackers. This does, however, have a very negative impact on data locality.
Re: Topology script - does it take DNS names, or IPs, or both?
On Dec 2, 2010, at 11:41 PM, Erik Forsberg wrote: Hi! I'm a bit confused about the arguments for the network topology script you can specify with topology.script.file.name - are they IPs, or dns names, or does it vary depending on if the IP could be resolved into a name? You should make it work with both.
Re: Not a host:port pair: local
On Nov 19, 2010, at 5:56 PM, Skye Berghel wrote: All of the information I've seen online suggests that this is because mapreduce.jobtracker.address is set to local. However, in conf/mapred-site.xml I have property namemapreduce.jobtracker.address/name valuemyserver:/value descriptionthe jobtracker server/description /property which means that the jobtracker shouldn't be set to local in the first place. Does anyone have any pointers? Check to make sure your XML is completely valid. You might have a missing / somewhere.
Re: 0.21 found interface but class was expected
[Yes, gmail people, this likely went to your junk folder. ] On Nov 13, 2010, at 5:28 PM, Lance Norskog wrote: It is considered good manners :) Seriously, if you want to attract a community you have an obligation to tell them when you're going to jerk the rug out from under their feet. The rug has been jerked in various ways for every micro version since as long as I've been with Hadoop. Such jerkings have always (eventually) been for the positive with a happy ending almost every time. No pain, no gain. Oh, one other thing. Here we are, several fairly significant conferences later (both as the main focus and as one of the leading topics) and I still don't understand why people have concerns about attracting a community. When you have what seems like 100s of companies creating products either built around or integrating Hadoop (the full gamut of several stealth startups to Major Players like IBM), it doesn't really seem like that is much of an issue anymore. At this point, I'm actually in the opposite camp: the community has grown TOO fast to the point that major problems in the source won't be able to be fixed because folks will expect less breakage. This is especially easy for sites with a few hundred nodes (or with enough frosting on top) because everything seems to be working for them. Many of them will not really understand that at super large scales, some things just don't work. In order to fix some of the issues, breakage will occur. The end result can either be a community divided into multiple camps due to forking or a community that has learned to tolerate these minor inconveniences when they pop up. I for one would rather be in the latter, but it sure seems like some parts of the community (and in many ways, the ASF itself) would rather it be the former. Time will tell.
Re: 0.21 found interface but class was expected
On Nov 14, 2010, at 12:05 AM, Allen Wittenauer wrote: The rug has been jerked in various ways for every micro version since as long as I've been with Hadoop. s,micro,minor, But i'm sure you knew that.
Re: Support for different users and different installation directories on different machines?
On Nov 7, 2010, at 9:44 AM, Zhenhua Guo wrote: It seems that currently Hadoop assumes that it is installed under the same directory using the same userid on different machines. Currently, following two bullets cannot be done without hacks (correct me if I am wrong): 1) install Hadoop as different users on different machines 2) Hadoop is installed under different directories. Anybody had the same problem? How did you solve it? The start and stop scripts have directory layout assumptions. The alternative is to use hadoop-daemon.sh to start/stop manually each service on each node. To work around different users, you might need to modify the members list of the supergroup. But I don't remember if that is necessary or not. It should be noted that using different users and different directory layouts is not recommended and you will likely end up being very very frustrated.
Re: Problem running the jar files from web page
On Nov 7, 2010, at 10:48 AM, sudharshan wrote: Hello I am a newbie to hadoop environment. I got stuck with a problem on running jar file in hadoop I have a mapreduce application for which I have created a web interface to run. The application runs perfectly in command line terminal. But when same command is issued from web interface through perl script it doesnt work and exits with 65280 error message. Is this the return code from system() in perl? Are you shifting the bits appropriately? [See http://perldoc.perl.org/functions/system.html for more details.]