Re: maybe a bug in hadoop?
On Wed, 2009-06-10 at 13:56 +0100, stephen mulcahy wrote: Hi, I've been running some tests with hadoop in pseudo-distributed mode. My config includes the following in conf/hdfs-site.xml property namedfs.data.dir/name value/hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4/value /property When I started running hadoop I was disappointed to see that only /hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME now contained the following new directories HADOOP_HOME/ /hdfs/disk2 HADOOP_HOME/ /hdfs/disk3 HADOOP_HOME/ /hdfs/disk4 I guess any leading spaces should be stripped out of the data dir names? Or maybe there is a reason for this behaviour. I thought I should mention it just in case. I thought I'd double check... $touch \ file\ with\ leading\ space $touch file\ without\ leading\ space $ ls file with leading space file without leading space ... so stripping it out could mean you couldn't enter some valid directory names (but who has folders starting with a space?) I'm sure my input wasn't very useful, but just a comment. Tim Wintle
Re: Amazon Elastic MapReduce
On Fri, 2009-04-03 at 11:19 +0100, Steve Loughran wrote: True, but this way nobody gets the opportunity to learn how to do it themselves, which can be a tactical error one comes to regret further down the line. By learning the pain of cluster management today, you get to keep it under control as your data grows. Personally I don't want to have to learn (and especially not support in production) the EC2 / S3 part, so it does sound appealing. On a side note, I'd hope that at some point they give some control over the priority of the overall job - on the level of you can boot up these machines whenever you want, or boot up these machines now - that should let them manage the load on their hardware and reduce costs (which I'd obviously expect them to pass on the users of low-priority jobs). I'm not sure how that would fit into the give me 10 nodes method at the moment. I am curious what bug patches AWS will supply, for they have been very silent on their hadoop work to date. I'm hoping it will involve security of EC2 images, but not expectant.
Re: How many people is using Hadoop Streaming ?
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote: 1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and Prolog at the reduce(). Mix and match can optimize me more. Agreed (as someone who has written mappers/reducers in Python, perl, shell script and Scheme before).
Re: Does reduce start only after the map is completed?
On Sat, 2009-03-07 at 23:03 +0300, Mithila Nagendra wrote: Hey all Im using the hadoop version 0.18.3, and was wondering if the reduce phase starts only after the mapping is completed? Is it required that the Map phase is a 100% done, or can it be programmed in such a way that the reduce starts earlier? As I understand it, the reducers have three phases: 1) Copy Data from the mappers (Shuffle) 2) Sort the data on the reducer (by key) 3) Actually run the data through the function you've defined. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reducer.html The Reducer tasks/processes start as soon as they are able to (I believe), and copying data and sorting happens while there may still be mappers running. Stage (3) cannot be run until stage (2) is completed, which can obviously not happen until all the mappers are complete. In my experience, I haven't found this a major issue (especially if there are many times more mappers than machines), since the shuffle and sort stages take significant time and effort anyway. Tim Wintle
Re: Off topic: web framework for high traffic service
On Wed, 2009-03-04 at 23:14 +0100, Lukáš Vlček wrote: Sorry for off topic question It is very off topic. Any ideas, best practices, book recomendations, papers, tech talk links ... I found this a nice little book: http://developer.yahoo.net/blog/archives/2008/11/allspaw_capacityplanning.html and this a nice techtalk on why the servers aren't the most important thing to worry about. http://www.youtube.com/watch?v=BTHvs3V8DBA
Re: the question about the common pc?
On Mon, 2009-02-23 at 11:14 +, Steve Loughran wrote: Dumbo provides py support under Hadoop: http://wiki.github.com/klbostee/dumbo https://issues.apache.org/jira/browse/HADOOP-4304 Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python wrapper to streaming I'd hacked together. I'm probably going to be using hadoop more again in the near future so I'll bookmark that, thanks Steve. Personally I only need text based records, so I'm fine using a wrapper around streaming Tim Wintle
Re: the question about the common pc?
On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote: I've been doing MapReduce work over small in-memory datasets using Erlang, which works very well in such a context. I've got some (mainly python) scripts (that will probably be run with hadoop streaming eventually) that I run over multiple cpus/cores on a single machine by opening the appropriate number of named pipes and using tee and awk to split the workload something like mkfifo mypipe1 mkfifo mypipe2 awk '0 == NR % 2' mypipe1 | ./mapper | sort map_out_1 awk '0 == (NR+1) % 2' mypipe2 | ./mapper | sort map_out_2 ./get_lots_of_data | tee mypipe1 mypipe2 (wait until it's done... or send a signal from the get_lots_of_data process on completion if it's a cronjob) sort -m map_out* | ./reducer reduce_out works around the global interpreter lock in python quite nicely and doesn't need people that write the scripts (who may not be programmers) to understand multiple processes etc, just stdin and stdout. Tim Wintle
Re: Re:Re: the question about the common pc?
On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote: Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs. The phrase High Performance Computing equipment makes me think of infiniband, fibre all over the place etc. Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no special hardware you couldn't find in a standard pc. That doesn't mean you should run it on pcs that are being used for other things though. I found that hadoop ran ok on fairly old hardware - a load of old power-pc macs (running linux) churned through some jobs quickly, and I've actually run it on people's office machines during the nights (not on Windows). I did end up having to add an extra switch in for the part of the network that was only 100 mbps to get the throughput though. Of course ideally you would be running it on a rack of 1u servers, but that's still normally standard pc hardware.
Re: architecture diagram
I normally find the intermediate stage of copying data to the reducers from the mappers to be a significant step - but that's not over the best quality switches... The mappers and reducers work on the same boxes, close to the data. On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote: It really depends on your job I think. Often reduce steps can be the bottleneck if you want a single output file (one reducer). Hope this helps. Alex
Re: Accessing input files from different servers
a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) in all those machines where the log files are getting created ? And then do a file get from the central HDFS server` ? I'd install hadoop on the machine, but you don't have to start any nodes there - you can log onto a cluster running elsewhere using the command line tools to put / get data from the cluster. From what I recall, this is actually better than running nodes locally as if you put data on locally, the blocks will tend to be posted to the local machine. Tim signature.asc Description: This is a digitally signed message part
Re: HDFS Vs KFS
I haven't used KFS, but I believe a major difference is that you can (apparently) mount KFS as a standard device under Linux, allowing you to read and write directly to it without having to re-compile the application (as far as I know that's not possible with HDFS, although the last time I installed HDFS was 0.16) ... It is definitely much newer. On Fri, 2008-08-22 at 01:35 +0800, rae l wrote: On Fri, Aug 22, 2008 at 12:34 AM, Wasim Bari [EMAIL PROTECTED] wrote: KFS is also another Distributed file system implemented in C++. Here you can get details: http://kosmosfs.sourceforge.net/ Just from the basic information: http://sourceforge.net/projects/kosmosfs # Developers : 2 # Development Status : 3 - Alpha # Intended Audience : Developers # Registered : 2007-08-30 21:05 and from the history of subversion repository: http://kosmosfs.svn.sourceforge.net/viewvc/kosmosfs/trunk/ I think it's just not stable and not widely used as HDFS: * HDFS is stable and production level available. This maybe not totally right and I'm waiting someone more familiar to KFS to talk about this. signature.asc Description: This is a digitally signed message part
Re: anybody know how to run sshd in LEOPARD
I've set hadoop up on a load of Intel Macs before - I think that sshd is what Apple call Remote Log-in or something like that - it was a GUI option to allow an account to log in remotely. Hope that helps On Tue, 2008-06-17 at 14:27 +0800, j.L wrote: i wanna try hadoop, but i can't run sshd when i use macbook(leopard)
RE: Questions regarding configuration parameters...
I have had exactly the same problem with using the command line to cat files - they can take for ages, although I don't know why. Network utilisation does not seem to be the bottleneck, though. (Running 0.15.3) Is the slow part of the reduce while you are waiting for the map data to copy over to the reducers? I believe there was a bug prior to 0.16.0 that could leave you waiting for a long time if mappers had been too slow to respond to previous requests (even if they were completely free now) On Thu, 2008-02-21 at 21:51 -0800, C G wrote: My performance problems fall into 2 categories: 1. Extremely slow reduce phases - our map phases march along at impressive speed, but during reduce phases most nodes go idle...the active machines mostly clunk along at 10-30% CPU. Compare this to the map phase where I get all grid nodes cranking away at 100% CPU. This is a vague explanation I realize. 2. Pregnant pauses during dfs -copyToLocal and -cat operations. Frequently I'll be iterating over a list of HDFS files cat-ing them into one file to bulk load into a database. Many times I'll see one of the copies/cats sit for anywhere from 2-5 minutes. During that time no data is transferred, all nodes are idle, and absolutely nothing is written to any of the logs. The file sizes being copied are relatively small...less than 1G each in most cases. Both of these issues persist in 0.16.0 and definitely have me puzzled. I'm sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but the long pauses during a dfs command line operation seems like a bug to me. Unfortunately I've not seen anybody else report this. Any thoughts/ideas most welcome... Thanks, C G Joydeep Sen Sarma [EMAIL PROTECTED] wrote: The default value are 2 so you might only see 2 cores used by Hadoop per node/host. that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency). unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc. -Original Message- From: Andy Li [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 2:36 PM To: core-user@hadoop.apache.org Subject: Re: Questions regarding configuration parameters... Try the 2 parameters to utilize all the cores per node/host. mapred.tasktracker.map.tasks.maximum 7 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 7 The maximum number of reduce tasks that will be run simultaneously by a task tracker. The default value are 2 so you might only see 2 cores used by Hadoop per node/host. If each system/machine has 4 cores (dual dual core), then you can change them to 3. Hope this works for you. -Andy On Wed, Feb 20, 2008 at 9:30 AM, C G wrote: Hi All: The documentation for the configuration parameters mapred.map.tasks and mapred.reduce.tasks discuss these values in terms of number of available hosts in the grid. This description strikes me as a bit odd given that a host could be anything from a uniprocessor to an N-way box, where values for N could vary from 2..16 or more. The documentation is also vague about computing the actual value. For example, for mapred.map.tasks the doc says .a prime number several times greater.. I'm curious about how people are interpreting the descriptions and what values people are using. Specifically, I'm wondering if I should be using core count instead of host count to set these values. In the specific case of my system, we have 24 hosts where each host is a 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the value 173, as that is a prime number which is near 7*24. For mapred.reduce.tasks I chose 23 since that is a prime number close to 24. Is this what was intended? Beyond curiousity, I'm concerned about setting these values and other configuration parameters correctly because I am pursuing some performance issues where it is taking a very long time to process small amounts of data. I am hoping that some amount of tuning will resolve the problems. Any thoughts and insights most appreciated. Thanks, C G - Never miss a thing. Make Yahoo your homepage. - Looking for last minute shopping deals? Find them fast with Yahoo! Search.
Re: Calculations involve large datasets
Have you seen PIG: http://incubator.apache.org/pig/ It generates hadoop code and is more query like, and (as far as I remember) includes union, join, etc. Tim On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote: Hi, I'm currently looking into how to better scale the performance of our calculations involving large sets of financial data. It is currently using a series of Oracle SQL statements to perform the calculations. It seems to me that the MapReduce algorithm may work in this scenario. However, I believe would need to perform some denormalization of data in order for this to work. Do I have to? Or is there a good way to implement joins within the Hadoop framework efficiently? Thanks, Chuck
Re: Hadoop summit / workshop at Yahoo!
I would certainly appreciate being able to watch them online too, and they would help spread the word about hadoop - think of all the people who watch Google's Techtalks (am I allowed to say the G word around here?). On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote: Online webcast/recorded video would be really appreciated by lot of people. Please post the content online! (not only you can target much greater audience but you can significantly save on break/lunch/beer food budget :-). Lukas On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED] wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are interested in attending. About the summit: Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform. Agenda: 8:30-8:55 Breakfast 8:55-9:00 Welcome to Yahoo! Logistics - Ajay Anand, Yahoo! 9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo! 9:30-10:00 Pig - Chris Olston, Yahoo! 10:00-10:30 JAQL - Kevin Beyer, IBM 10:30-10:45 Break 10:45-11:15 DryadLINQ - Michael Isard, Microsoft 11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei Zaharia, UC Berkeley 11:45-12:15 Zookeeper - Ben Reed, Yahoo! 12:15-1:15 Lunch 1:15-1:45 Hbase - Michael Stack, Powerset 1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf 2:15-2:45 Hive - Joydeep Sen Sarma, Facebook 2:45-3:00 Break 3:00-3:20 Building Ground Models of Southern California - Steve Schossler, David O'Hallaron, Intel / CMU 3:20-3:40 Online search for engineering design content - Mike Haley, Autodesk 3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo! 4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland / Christophe Bisciglia, Google 4:30-4:45 Break 4:45-5:30 Panel on future directions 5:30-7:00 Happy hour Look forward to seeing you there! Ajay -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 20, 2008 9:17 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Hrm yes, I'd like to make a visit as well :) On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote: Hey All: Is this going forward? I'd like to make plans to attend and the sooner I can get plane tickets the happier the bean counters will be :-). Thx, C G Ajay Anand wrote: Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like to cover topics in the areas of extensions being developed for Hadoop, innovative applications being built and deployed on Hadoop, and future extensions to the platform. Some of the speakers who have already committed to present are from organizations such as IBM, Intel, Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and we are actively recruiting other leaders in the space. If you have an innovative application you would like to talk about, please let us know. Although there are limitations on the amount of time we have, we would love to hear from you. You can contact me at [EMAIL PROTECTED] Thanks and looking forward to hearing about your cool apps, Ajay -- View this message in context: http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15 393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15393386.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: URLs contain non-existant domain names in machines.jsp
I agree, this is a really annoying problem - most of the job appears to work, but unfortunately the reduce stage doensn't normally work. Interestingly, when hadoop runs on OSX it seems to set the hostname as the ip (or sets a hostname through zeroconfig). Would be useful if we could use just ip address, though (especially for dynamic clusters where machines are being added / removed fairly often) On Sat, 2008-02-09 at 21:11 +0530, Ben Kucinich wrote: I made a small mistake describing my problem. There is no 192.168.1.8. There is only one machine, 192.168.101.8. I'll describe my problem again. 1. I have set up a single-node cluster on 192.168.101.8. It is an Ubuntu server. 2. There is no entry for 192.168.101.8 in the DNS server. However, the hostname is set to be hadoop in this server. But this is only local. If I ping hadoop locally, it works. But if I ping hadoop or ping hadoop.domain.example.com from another system it doesn't work. From another system I have to ping 192.168.101.8. So, I hope I have made it clear that hadoop.domain.example.com does not exist in our DNS server. 3. domain.example.com is only a dummy example. Of course the actual name is the domain name of our organization. 4. I started hadoop on this server with the command, bin/hadoop namenode -format; bin/start-all.sh 5. jps showed all the processes started successfully. 6. Here is my hadoop-site.xml configuration property namefs.default.name/name value192.168.101.8:9000/value description/description /property property namemapred.job.tracker/name value192.168.101.8:9001/value description/description /property property namedfs.replication/name value1/value description/description /property /configuration 7. I am running a few ready examples present in hadoop-0.15.3-examples.jar, especially, the wordcount one. I am also putting some files into the DFS from remote systems, such as, 192.168.101.100, 192.168.101.101, etc. But these remote systems are not slaves. 8. From a remote system, I try to access:- http://192.168.101.8:50030/machines.jsp It showed:- Name Host# running tasks FailuresSeconds since heartbeat tracker_hadoop.domain.example.com:/127.0.0.1:4545 hadoop.domain.example.com 0 0 9 Now, when I click on tracker_hadoop..domain.example.com:/127.0.0.1:4545 link it takes me to http://hadoop.domain.example.com:50060/. But it gives error in the browser because of reason mentioned in point 2. I don't want it to use the hostname to form those links. I want it to use the IP address, 192.168.101.8 to form the links. Is it possible? On Feb 9, 2008 7:49 PM, Amar Kamat [EMAIL PROTECTED] wrote: Ben Kucinich wrote: I have a Hadoop running on a master node 192.168.1.8. fs.default.name is 192.168.101.8:9000 and mapred.job.tracker is 192.168.101.8:9001. Actually the masters are the nodes where the JobTracker and the NameNode are running i.e 192.168.101.8 in your case. 192.168.1.8 would be your client node, the node from where the jobs are submitted. I am accessing it's web pages on port 50030 from another machine. I visited http://192.168.101.8:50030/machines.jsp. It showed:- Name Host# running tasks FailuresSeconds since heartbeat tracker_hadoop.domain.example.com:/127.0.0.1:4545 hadoop.domain.example.com 0 0 9 The tacker-name is tracker_tracker-hostname:port where hostname is obtained from the DNS nameserver passed by 'mapred.tasktracker.dns.nameserver' in conf/hadoop-default.xml. So I guess in your case hadoop.domain.example.com is the name obtained from the DNS nameserver for that node. Can you provide more details on the xml parameters you have changed in conf directory. Also can you provide more details on how you are starting your hadoop. Amar Now, when I click on tracker_hadoop..domain.example.com:/127.0.0.1:4545 link it takes me to http://hadoop.domain.example.com:50060/. But there is no DNS entry for hadoop in our DNS server. So, I get error in browser. hadoop is just the locally set name in the master node. From my machine I can't access the master node as hadoop. I have to access it as IP address 192.168.101.8. So, this link fails. Is there a way I can set it so that, it doesn't use names but only IP address in forming this link?
RE: Starting up a larger cluster
You can set which nodes are allowed to connect in hadoop-site.xml - it's useful to be able to connect from nodes that aren't in the slaves file so that you can put in input data direct from another machine that's not part of the cluster, or add extra machines on the fly (just make sure they're routing correctly first!). You can also run jobs direct from your workstation (without having to scp your code, ssh etc) If you look through the shell scripts you should see exactly what the slaves file is used for. It's fairly easy to modify the scripts to start a single rack, so you should be able to bring up machines when you need them. Tim On Thu, 2008-02-07 at 12:24 -0800, Jeff Eastman wrote: Hi Ben, I've been down this same path recently and I think I understand your issues: 1) Yes, you need the hadoop folder to be in the same location on each node. Only the master node actually uses the slaves file, to start up DataNode and JobTracker daemons on those nodes. 2) If you did not specify any slave nodes on your master node then the start-all did not create these processes on any nodes other than master. This node can be accessed and the dfs written to from other machines as you can do but there is no replication since there is only one DataNode. Try running jps on your other nodes to verify this, and access the NameNode web page to see what slaves you actually have running. By adding your slave nodes to the slaves file on your master and bouncing hadoop you should see a big difference in the size of your cluster. Good luck, it's an adventure, Jeff -Original Message- From: Ben Kucinich [mailto:[EMAIL PROTECTED] Sent: Thursday, February 07, 2008 10:52 AM To: core-user@hadoop.apache.org Subject: Starting up a larger cluster In the Nutch wiki, I was reading this http://wiki.apache.org/hadoop/GettingStartedWithHadoop I have problems understanding this section: == Starting up a larger cluster == Ensure that the Hadoop package is accessible from the same path on all nodes that are to be included in the cluster. If you have separated configuration from the install then ensure that the config directory is also accessible the same way. Populate the slaves file with the nodes to be included in the cluster. One node per line. 1) Does the first line mean, that I have to place the hadoop folder in exactly the same location on every slave node? For example, if I put hadoop home directory in my /usr/local/ in master node it should be present in /user/local/ in all the slave nodes as well? 2) I ran start-all.sh in one node (192.168.1.2) with fs.default.name as 192.168.1.2:9000 and mapred.job.tracker as 192.168.1.2:9001. So, I believe this will play the role of master node. I did not populate the slaves file with any slave nodes. But in many other systems, 192.168.1.3, 192.168.1.4, etc. I made the same settings in hadoop-site.xml. So I believe these are slave nodes. Now in the slave nodes I ran commands like bin/hadoop -dfs put dir newdir and the newdir was created in the DFS. I wonder how the master node allowed the slave nodes to put the files even though I did not populate the slaves file. Please help me with these queries since I am new to Hadoop.
Re: Namenode fails to replicate file
Doesn't the -setrep command force the replication to be increased immediately? ./hadoop dfs -setrep [replication] path (I may have misunderstood) On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote: Chris Kline reported a problem in early January where a file which had too few replicated blocks did not get replicated until a DFS restart. I just saw a similar issue. I had a file that had a block with 1 replica (2 required) that did not get replicated. I changed the number of required replicates, but nothing caused any action. Changing the number of required replicas on other files got them to be replicated. I eventually copied the file to temp, deleted the original and moved the copy back to the original place. I was also able to read the entire file which shows that the problem was not due to slow reporting from a down datanode. This happened just after I had a node failure which was why I was messing with replication at all. Since I was in the process of increasing the replication on nearly 10,000 large files, my log files are full of other stuff, but I am pretty sure that there is a bug here. This was on a relatively small cluster with 13 data nodes. It also brings up a related issue that has come up before in that there are times when you may want to increase the number of replicas of a file right NOW. I don't know of any way to force this replication. Is there such a way?