Re: [help]how to stop HDFS
On 30/11/11 04:29, Nitin Khandelwal wrote: Thanks, I missed the sbin directory, was using the normal bin directory. Thanks, Nitin On 30 November 2011 09:54, Harsh Jha...@cloudera.com wrote: Like I wrote earlier, its in the $HADOOP_HOME/sbin directory. Not the regular bin/ directory. On Wed, Nov 30, 2011 at 9:52 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: I am using Hadoop 0.23.0 There is no hadoop-daemon.sh in bin directory.. I found the 0.23 scripts to be hard to set up, and get working https://issues.apache.org/jira/browse/HADOOP-7838 https://issues.apache.org/jira/browse/MAPREDUCE-3430 https://issues.apache.org/jira/browse/MAPREDUCE-3432 I'd like to see what Bigtop will offer in this area, as their test process will involve installing onto system images and walking through the scripts. the basic hadoop tars assume your system is well configured and you know how to do this -and debug problems
Re: How is network distance for nodes calculated
On 22/11/11 21:04, Edmon Begoli wrote: I am reading Hadoop Definitive Guide 2nd Edition and I am struggling to figure out the exact Hadoop's formula for network distance calculation (page 64/65). (I have my guesses, but I would like to know the exact formula) It's implemented in org.apache.hadoop.net.NetworkTopology It's measuring the #of network hops to get there, 2 = n1 - switch1 - n2 etc
Re: Adding a new platform support to Hadoop
On 17/11/11 15:02, Amir Sanjar wrote: Is there any specific development, build, and packaging guidelines to add support for a new hardware platform, in this case PPC64, to hadoop? Best Regards Amir Sanjar Linux System Management Architect and Lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858 this is something to take up on the -dev lists, not the user lists, especially common-...@hadoop.apache.org One problem with any platform is the native code: nobody but you is going to build or test it. The only JVM currently recommended is the Sun JVM, so again, you will get to test there. This means you are going to have to be active testing releases against your target platform. Otherwise it will languish in the not really meant to be used in production category of things. The apache releases (which are meant to be source distributions anyway; the binary artifacts are just an extra), but you will need to work with the dev team to make sure the native libraries build properly
Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2
On 24/10/11 23:46, Mark question wrote: Thank you, I'll try it. Mark On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooquicassandral...@gmail.comwrote: Mark, We figured it out. It's an issue with RedHat's IPTables. You have to open up those ports: vim /etc/sysconfig/iptables Of course, if you open up the cluster ports to everyone, that means everyone else with an IPv4 address. SSH tunnelling is a better tactic
Re: execute hadoop job from remote web application
On 18/10/11 17:56, Harsh J wrote: Oleg, It will pack up the jar that contains the class specified by setJarByClass into its submission jar and send it up. Thats the function of that particular API method. So, your deduction is almost right there :) On Tue, Oct 18, 2011 at 10:20 PM, Oleg Ruchovetsoruchov...@gmail.com wrote: So you mean that in case I am going to submit job remotely and my_hadoop_job.jar will be in class path of my web application it will submit job with my_hadoop_job.jar to remote hadoop machine (cluster)? There's also the problem of waiting for your work to finish. If you want to see something complicated that does everything but JAR upload, I have some code here that listens for events coming out of the job and so builds up a history of what is happening. It also does better preflight checking of source and dest data directories http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/mapreduce/submitter/SubmitterImpl.java
Re: automatic node discovery
On 18/10/11 10:48, Petru Dimulescu wrote: Hello, I wonder how do you guys see the problem of automatic node discovery: having, for instance, a couple of hadoops, with no configuration explicitly set whatsoever, simply discover each other and work together, like Gridgain does: just fire up two instances of the product, on the same machine or on different machines in the same LAN, they will use mulitcast or whatever to discover each other you can use techniques like Bonjour to have hadoop services register themselves in DNS and locate that way, but things only need to discover the NN and JT and report in. and to be a part of a self-discovered topology. Topology inference is an interesting problem. Something purely for diagnostics could be useful. Of course, if you have special network requirements you should be able to specify undiscovarable nodes by IP or name but often grids are installed on LANs and it should really be simpler. In a production system I'd have a private switch and isolate things for bandwidth and security; this is why auto configuration is generally neglected. If it were to be added, it would go via Zookeeper, leaving only the zookeeper discovery problem. You can't rely on DNS or multicast IP here as it doesn't always work in virtualised environments. Namenodes are a bit different, they should use safer machines, I'm basically talking about datanodes here, but still I wonder how hard can it be to have self-assigned namenodes, maybe replicated automatically on several machines, unless one specific namenode is explicitly set via xml configuration. I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns and as automatic replication isn't there it's a non-issue. With fixed NN and JT entries in the DNS table, anything can come up in the LAN and talk to them unless you set up the master nodes with lists of things you trust. Also, the ssh passwordless thing is so awkward. If you have a network of hadoop that mutually discover each other there is really no need for this passwordless ssh requirement. This is more of a system administrator aspect, if sysadmins want to automatically deploy or start a program on 5000 machines they often have the toolsskills to do that, it should not be a requirement. It's not a requirement, there are other ways to deploy. Large clusters tend to use cluster management tooling that keeps the OS images consistent, or you can use more devops-centric tooling (inc Apache Whirr) to roll things out.
Re: execute hadoop job from remote web application
On 18/10/11 11:40, Oleg Ruchovets wrote: Hi , what is the way to execute hadoop job on remote cluster. I want to execute my hadoop job from remote web application , but I didn't find any hadoop client (remote API) to do it. Please advice. Oleg the Job class lets you build up and submit jobs from any java process that has RPC access to the Job Tracker
Re: hadoop knowledge gaining
On 07/10/11 15:25, Jignesh Patel wrote: Guys, I am able to deploy the first program word count using hadoop. I am interesting exploring more about hadoop and Hbase and don't know which is the best way to grasp both of them. I have hadoop in action but it has older api. Actually the API covered in the 2nd edition is pretty much the one in widest use. The newer API is better, but is only as complete in hadoop 0.21 and later, which aren't yet in wide use I do also have Hbase definitive guide which I have not started exploring. Think of a problem, get some data, go through the books. Learning more about statistics and datamining is what you really need to learn, more than just the hadoop APIs -steve
Re: FileSystem closed
On 29/09/2011 18:02, Joey Echeverria wrote: Do you close your FileSystem instances at all? IIRC, the FileSystem instance you use is a singleton and if you close it once, it's closed for everybody. My guess is you close it in your cleanup method and you have JVM reuse turned on. I've hit this in the past. In 0.21+ you can ask for a new instance explicity. For 0.20.20x, set fs.hdfs.impl.disable.cache to true in the conf, and new instances don't get cached.
Re: Hadoop performance benchmarking with TestDFSIO
On 28/09/11 22:45, Sameer Farooqui wrote: Hi everyone, I'm looking for some recommendations for how to get our Hadoop cluster to do faster I/O. Currently, our lab cluster is 8 worker nodes and 1 master node (with NameNode and JobTracker). Each worker node has: - 48 GB RAM - 16 processors (Intel Xeon E5630 @ 2.53 GHz) - 1 Gb Ethernet connection Due to company policy, we have to keep the HDFS storage on a disk array. Our SAS connected array is capable of 6 Gb (768 MB) for each of the 8 hosts. So, theoretically, we should be able to get a max of 6 GB simultaneous reads across the 8 nodes if we benchmark it. missing the point on Hadoop there; you will end up getting the bandwidth of the HDD most likely to fail next, copy replication is overkill and you will reach limits on scale both technical (SAN scalability) and financial. Our disk array is presenting each of the 8 nodes with a 21 TB LUN. The LUN is RAID-5 across 12 disks on the array. That LUN is partitioned on the server into 6 different devices like this: The file system type is ext3. set noatime So, when we run TestDFSIO, here are the results: *++ Write ++* hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -write -nrFiles 80 -fileSize 1 11/09/27 18:54:53 INFO fs.TestDFSIO: - TestDFSIO - : write 11/09/27 18:54:53 INFO fs.TestDFSIO:Date time: Tue Sep 27 18:54:53 EDT 2011 11/09/27 18:54:53 INFO fs.TestDFSIO:Number of files: 80 11/09/27 18:54:53 INFO fs.TestDFSIO: Total MBytes processed: 80 11/09/27 18:54:53 INFO fs.TestDFSIO: Throughput mb/sec: 8.2742240008678 11/09/27 18:54:53 INFO fs.TestDFSIO: Average IO rate mb/sec: 8.288116455078125 11/09/27 18:54:53 INFO fs.TestDFSIO: IO rate std deviation: 0.3435565217052116 11/09/27 18:54:53 INFO fs.TestDFSIO: Test exec time sec: 1427.856 So, throughput across all 8 nodes is 8.27 * 80 = 661 MB per second. *++ Read ++* hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -read -nrFiles 80 -fileSize 1 11/09/27 19:43:12 INFO fs.TestDFSIO: - TestDFSIO - : read 11/09/27 19:43:12 INFO fs.TestDFSIO:Date time: Tue Sep 27 19:43:12 EDT 2011 11/09/27 19:43:12 INFO fs.TestDFSIO:Number of files: 80 11/09/27 19:43:12 INFO fs.TestDFSIO: Total MBytes processed: 80 11/09/27 19:43:12 INFO fs.TestDFSIO: Throughput mb/sec: 5.854318503905489 11/09/27 19:43:12 INFO fs.TestDFSIO: Average IO rate mb/sec: 5.96372652053833 11/09/27 19:43:12 INFO fs.TestDFSIO: IO rate std deviation: 0.9885505979030621 11/09/27 19:43:12 INFO fs.TestDFSIO: Test exec time sec: 2055.465 So, throughput across all 8 nodes is 5.85 * 80 = 468 MB per second. *Question 1:* Why are the reads and writes so much slower than expected? Any suggestions about what can be changed? I understand that RAID-5 backed disks are an unorthodox configuration for HDFS, but has anybody successfully done this? If so, what kind of results did you see? Also, we detached the 8 nodes from the disk array and connected each of them to 6 local hard drives for testing (w/ ext4 file system). Then we ran the same read TestDFSIO and saw this: 11/09/26 20:24:09 INFO fs.TestDFSIO: - TestDFSIO - : read 11/09/26 20:24:09 INFO fs.TestDFSIO:Date time: Mon Sep 26 20:24:09 EDT 2011 11/09/26 20:24:09 INFO fs.TestDFSIO:Number of files: 80 11/09/26 20:24:09 INFO fs.TestDFSIO: Total MBytes processed: 80 11/09/26 20:24:09 INFO fs.TestDFSIO: Throughput mb/sec: 13.065623285187982 11/09/26 20:24:09 INFO fs.TestDFSIO: Average IO rate mb/sec: 15.160531997680664 11/09/26 20:24:09 INFO fs.TestDFSIO: IO rate std deviation: 8.000530562022949 11/09/26 20:24:09 INFO fs.TestDFSIO: Test exec time sec: 1123.447 So, with local disks, reads are about 1 GB per second across the 8 nodes. Much faster! Much lower cost per TB too. Orders of magnitude lower. With 6 local disks, writes performed the same though: 11/09/26 19:49:58 INFO fs.TestDFSIO: - TestDFSIO - : write 11/09/26 19:49:58 INFO fs.TestDFSIO:Date time: Mon Sep 26 19:49:58 EDT 2011 11/09/26 19:49:58 INFO fs.TestDFSIO:Number of files: 80 11/09/26 19:49:58 INFO fs.TestDFSIO: Total MBytes processed: 80 11/09/26 19:49:58 INFO fs.TestDFSIO: Throughput mb/sec: 8.573949802610528 11/09/26 19:49:58 INFO fs.TestDFSIO: Average IO rate mb/sec: 8.588902473449707 11/09/26 19:49:58 INFO fs.TestDFSIO: IO rate std deviation: 0.3639466752546032 11/09/26 19:49:58 INFO fs.TestDFSIO: Test exec time sec: 1383.734 Write throughput across the cluster was 685 MB per second. Writes get streamed to multiple HDFS nodes for redundancy; you've got the bandwidth + network overhead and 3x the data. Options -stop using HDFS on the SAN, it's the wrong approach. Mount the SAN directly and use file:// URLs, let the SAN do the networking and redundancy. -buy some local HDDs at least for all the temp data: logs, overspill
Re: Is SAN storage is a good option for Hadoop ?
On 29/09/11 13:28, Brian Bockelman wrote: On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote: Hi, I want to know can we use SAN storage for Hadoop cluster setup ? If yes, what should be the best pratices ? Is it a good way to do considering the fact the underlining power of Hadoop is co-locating the processing power (CPU) with the data storage and thus it must be local storage to be effective. *But also, is it better to say “local is better” in the situation where I have a single local 5400 RPM IDE drive, which would be dramatically slower than SAN storage striped across many drives spinning at 10k RPM and accessed via fiber channel ?* Hi Praveenesh, Two things: 1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the high-end SAN is going to win. That's often false comparison: the question is often What can I buy for $50k?. In that case (setting aside organizational politics), you can buy more spindles in the traditional Hadoop setup than for the SAN. - Also, if you're latency limited, you're likely working against yourself. The best thing I ever did for my organization was make our software work just as well with 100ms latency as with 1ms latency. 2) As Paul pointed out, you have to ask yourself whether the SAN is shared or dedicated. Many SANs don't have the ability to strongly partition workloads between users.. Brian One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on MS TerraServer, while [Jiang08] provides evidence that entry level FibreChannel storage is less reliable than SATA due to interconnects. Anyone who criticises the NameNode for being a SPOF and relies on a SAN instead is missing something obvious. [Gray05] Empirical Measurements of Disk Failure Rates and Error Rates [Jiang08] Are disks the dominant contributor for storage failures?
Re: difference between development and production platform???
On 28/09/11 04:19, Hamedani, Masoud wrote: Special Thanks for your help Arko, You mean in Hadoop, NameNode, DataNodes, JobTracker, TaskTrackers and all the clusters should deployed on Linux machines??? We have lots of data (on windows OS) and code (written in C#) for data mining, we wana to use Hadoop and make connection between our existing systems and programs with it. as you mentioned we should move all of our data to Linux systems, and execute existing C# codes in Linux and only use windows for development same as before. Am I right? What is really meant is nobody runs hadoop at scale on Windows. Specifically -there's an expectation that there is a unix API you can exec -some of the operations (e.g. how programs are exec()'d) are optimised for linux -everyone tests on 50+ node clusters on Linux. Why Linux? Stable, low cost. And you can install it on your laptop/desktop and develop there too. Because everyone uses Linux (or possibly a genuine Unix system like Solaris), problems encountered in real systems get found on Linux and fixed. If you want to run a production Hadoop cluster on Windows, you are free to do so. Just be aware that you may be the first person to do so at scale, so you get to find problems first, you get to file the bugs -and because you are the only person with these problems and the ability to replicate them- you get to fix them. Nobody is going to say oh, this patch is for Windows only use, we will reject it -at least provided it doesn't have adverse effects on Linux/Unix. It's just that nobody else publicly runs Hadoop on Windows. A key step 1 will be cross compiling all the native code to Windows, which on 0.23+ also means protocol buffers. Enjoy. Where you will find problems is that even on Win64, Hadoop can't directly load or run C# APPs or anything else written to compile against their managed runtime (I forget it's name). You will have to bridge via streaming, and take a performance hit. You could also try running the C# code under Mono on Linux; it may or may not work. Again, you get to find out and fix the problems -this time with the Mono project. -Steve
Re: hadoop question using VMWARE
On 28/09/11 08:37, N Keywal wrote: For example: - It's adding two layers (windows linux), that can both fail, especially under heavy workload (and hadoop is built to use all the resources available). They will need to be managed as well (software upgrades, hardware support...), it's an extra cost. - These two layers will use randomly the different resources (HDD, CPU,network) making issues and performance analysis more complicated. - there will be a real performance impact. It's depends on what you do, and how is configured Windows vmware, but on my non optimized laptop I lose more than 50%. VMWare claims 15% max, but it's without Windows (using direct ESX) Where you take a big hit is in disk IO, as what your OS thinks is a disk with sequentially stored files is just a single file in the host OS that may be scattered round the real HDD. Disk IO goes through too many layers. It's often faster to NFS mount the real HDD. For compute intensive work, the performance hit isn't so bad, at least provided you don't swap. - Last time I checked (a few months ago), vmware was not able to use all the core memory of medium sized servers. Same with VirtualBox, which I like because it is lighter weight. I use VMs because the infrastructure provides it; things like ElasticMR from AWS also offer it. Your code may be slower, but what you get is the ability to bring up clusters on a pay-per-hour basis, and the ability to vary the #of machines based on the workload/execution plan. If you can compensate for the IO hit by renting four more servers, you may still come out ahead. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
Re: Environment consideration for a research on scheduling
On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote: If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation. Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time. Matt -Original Message- From: Merto Mertek [mailto:masmer...@gmail.com] Sent: Friday, September 23, 2011 10:00 AM To: common-user@hadoop.apache.org Subject: Environment consideration for a research on scheduling Hi, in the first phase we are planning to establish a small cluster with few commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server 10.10 and a hadoop build from the branch 0.20.204 (i had some issues with version 0.20.203 with missing librarieshttp://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567). Would you suggest any other version? I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL and CentOS are the platform of choice in the server side. In the second phase we are planning to analyse, test and modify some of hadoop schedulers. The main schedulers used by Y! and FB are fairly tuned for their workloads, and not apparently something you'd want to play with. There is at least one other scheduler in the contribs/ dir to play with. the other thing about scheduling is that you may have a faster development cycle if, instead of working on a real cluster, you simulate it and multiples of real time; using stats collected from your own workload by way of the gridmix2 tools. I've never done scheduling work, but think there's some stuff there to do that. if not, it's a possible contribution. Be aware that the changes in 0.23+ will change resource scheduling; this may be a better place to do development with a plan to deploy in 2012. Oh, and get on the mapreduce lists, esp, the -dev list, to discuss issues The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. I have no idea what that means but am not convinced that reading an email forces me to comply with a different country's rules
Re: Can we replace namenode machine with some other machine ?
On 22/09/11 05:42, praveenesh kumar wrote: Hi all, Can we replace our namenode machine later with some other machine. ? Actually I got a new server machine in my cluster and now I want to make this machine as my new namenode and jobtracker node ? Also Does Namenode/JobTracker machine's configuration needs to be better than datanodes/tasktracker's ?? 1. I'd give it lots of RAM - holding data about many files, avoiding swapping, etc. 2. I'd make sure the disks are RAID5, with some NFS-mounted FS that the secondary namenode can talk to. avoids risk of loss of the index, which, if it happens, renders your filesystem worthless. If I was really paranoid I'd have twin raid controllers with separate connections to disk arrays in separate racks, as [Jiang2008] shows that interconnect problems on disk arrays can be higher than HDD failures. 3. if your central switches are at 10 GbE, consider getting a 10GbE NIC and hooking it up directly -this stops the network being the bottleneck, though it does mean the server can have a lot more packets hitting it, so putting more load on it. 4. Leave space for a second CPU and time for GC tuning. JT's are less important; they need RAM but use HDFS for storage. If your cluster is small, NN and JT can be run locally. If you do this, set up DNS to have two hostnames to point to same network address. Then if you ever split them off, everyone whose bookmark says http://jobtracker won't notice Either way: the NN and the JT are the machines whose availability you care about. The rest is just a source of statistics you can look at later. -Steve [Jiang2008] Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage.
Re: Can we replace namenode machine with some other machine ?
On 22/09/11 17:13, Michael Segel wrote: I agree w Steve except on one thing... RAID 5 Bad. RAID 10 (1+0) good. Sorry this goes back to my RDBMs days where RAID 5 will kill your performance and worse... sorry, I should have said RAID =5. The main thing is you don't want the NN data lost. ever
Re: risks of using Hadoop
On 20/09/11 22:52, Michael Segel wrote: PS... There's this junction box in your machine room that has this very large on/off switch. If pulled down, it will cut power to your cluster and you will lose everything. Now would you consider this a risk? Sure. But is it something you should really lose sleep over? Do you understand that there are risks and there are improbable risks? We follow the @devops_borat Ops book and have a post-it-note on the switch saying not a light switch
Re: risks of using Hadoop
On 21/09/11 11:30, Dieter Plaetinck wrote: On Wed, 21 Sep 2011 11:21:01 +0100 Steve Loughranste...@apache.org wrote: On 20/09/11 22:52, Michael Segel wrote: PS... There's this junction box in your machine room that has this very large on/off switch. If pulled down, it will cut power to your cluster and you will lose everything. Now would you consider this a risk? Sure. But is it something you should really lose sleep over? Do you understand that there are risks and there are improbable risks? We follow the @devops_borat Ops book and have a post-it-note on the switch saying not a light switch :D Also we have a backup 4-port 1Gbe linksys router for when the main switch fails. The biggest issue these days is that since we switched the backplane to Ethernet over Powerline a power outage leads to network partitioning even when the racks have UPS. see also http://twitter.com/#!/DEVOPS_BORAT
Re: risks of using Hadoop
On 18/09/11 02:32, Tom Deutsch wrote: Not trying to give you a hard time Brian - we just have different users/customers/expectations on us. Tom, I suggest you read Apache goes realtime at facebook and consider how you could adopt those features -and how to contribute them back to the ASF. Certainly I'd like to see their subcluster placement policy in the codebase. For anyone doing batch work, I'd take the NN outage problem as an intermittent event that happens less often than OS upgrades -it's just something you should expect and test before your system goes live -make sure your secondary NN is working and you know how to handle a restart. Regarding the original discussion, 10-15 nodes has enough machines that the loss of one or two should be sustainable; with smaller clusters you get less benefit from replication (as each failing server loses a higher percentage of the blocks), but the probability of server failure is much less. You can fit everything into a single rack with a ToR switch running at 1 gigabit through the rack, 10 Gigabit if the servers have it on the mainboards and you can afford the difference, as it may mitigate some of the impact of server loss. Do think about expansion here; at least have enough ports for the entire rack, and the option of multiple 10 GbE interconnects to any other racks you may add later. Single switch clusters don't need any rack topology scripts, so you can skip on one bit of setup. As everyone says, you need to worry about namenode failure. You could put the secondary namenode on the same machine as the job tracker, and have them both write to NFS mounted filesystems. The trick in a small cluster is to use some (more than one) of the workers' disk space as those NFS mount points. Risks -security; you may want to isolate the cluster from the rest of your intranet -security: if I could run code on your cluster I could probably get at various server ports and read what I wanted. As all MR jobs are running code in the cluster, you have to trust people coding at the Java layer. If it's pig or hive jobs, life is simpler. -data integrity. Get ECC memory, monitor disks aggressively and take them offline if you think they are playing up. Run SATA VERIFY commands against the machines (in the sg3_utils package). -DoS to the rest of the Intranet. Bad code on a medium to large cluster can overload the rest of your network simply by making too many DNS requests, let alone lots of remote HTTP operations. This should not be a risk for smaller clusters. -Developers writing code that doesn't scale. You don't have to worry about this in a small cluster, but as you scale you will find use of choke points (JT counters, shared remote filestores) may cause problems. Even excessive logging can be trouble. -New feature for Ops: more monitoring to learn about. While the NN uptime matters, the worker nodes are less important. Don't have the team sprint to deal with a lost worker node. That said, for a small cluster I'd have a couple of replacement disks around, as the loss of disk would have more impact on total capacity. I've been looking at Hadoop Data Integrity, and now have a todo list based on my findings http://www.slideshare.net/steve_l/did-you-reallywantthatdata Because your cluster is small, you won't overload your NN even with small blocks, or the JT with jobs finishing too fast for it to keep up with, so you can use smaller blocks, which should improve data integrity. Otherwise, the main risk, as people note is unrealistic expectations. Hadoop is not a replacement for a database with ACID transaction requirements, even reads are slower than indexed tables. What it is good for is very-low-cost storage of large amounts of low-value data, and as a platform for the layers above. -Steve
Re: risks of using Hadoop
On 18/09/11 03:37, Michael Segel wrote: 2) Data Loss. You can mitigate this as well. Do I need to go through all of the options and DR/BCP planning? Sure there's always a chance that you have some Luser who does something brain dead. This is true of all databases and systems. (I know I can probably recount some of IBM's Informix and DB2 having data loss issues. But that's a topic for another time. ;-) That raises one more point. Once your cluster grows it's hard to back it up except to other Hadoop clusters. If you want survive loss-of-site events (power, communications) then you'll need to exchange copies of the high-value data between physically remote clusters. But you may not need to replicate at 3x remotely, because it's only backup data. -steve
Re: Hadoop with Netapp
On 25/08/11 08:20, Sagar Shukla wrote: Hi Hakan, Please find my comments inline in blue : -Original Message- From: Hakan (c)lter [mailto:hakanil...@gmail.com] Sent: Thursday, August 25, 2011 12:28 PM To: common-user@hadoop.apache.org Subject: Hadoop with Netapp Hi everyone, We are going to create a new Hadoop cluster in our company, i have to get some advises from you: 1. Does anyone have stored whole Hadoop data not on local disks but on Netapp or other storage system? Do we have to store datas on local disks, if so is it because of performace issues? sagar: Yes, we were using SAN LUNs for storing Hadoop data. SAN works faster than NAS in terms of performance while writing the data to the storage. Also SAN LUNs can be auto-mounted while booting up the system. Silly question: why? SANs are SPOFs (Gray van Ingen, MS, 2005; SAN responsible for 11% of terraserver downtime). Was it because you had the rack and wanted to run Hadoop, or did you want a more agile cluster? Because it's going to increase your cost of storage dramatically, which means you pay more per TB, or end up with less TB of storage. I wouldn't go this way for a dedicated Hadoop cluster. For a multi-use cluster, it's a different story 2. What do you think about running Hadoop nodes in virtual (VMware) servers? sagar: If high speed computing is not a requirement for you then Hadoop nodes in VM environment could be a good option, but one other slight drawback is when the VM crashes recovery of the in-memory data would be gone. Hadoop takes care of some amount of failover, but there is some amount of risk involved and requires good HA building capabilities. I do it for dev and test work, and for isolated clusters in a shared environment. -for CPU bound stuff, it actually works quite well, as there's no significant overhead -for HDD access, reading from the FS, writing to the FS and to store transient spill data you take a tangible performance hit. That's OK if you can afford to wait or rent a few extra CPUs -and your block size is such that those extra servers can help out -which may be in the map phase more than the reduce phase Some Hadoop-ish projects -Stratosphere from TuB in particular- are designed for VM infrastructure so come up with execution plans to use VMs efficiently. -steve
Re: Turn off all Hadoop logs?
On 29/08/11 20:31, Frank Astier wrote: Is it possible to turn off all the Hadoop logs simultaneously? In my unit tests, I don’t want to see the myriad “INFO” logs spewed out by various Hadoop components. I’m using: ((Log4JLogger) DataNode.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) LeaseManager.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) FSNamesystem.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) DFSClient.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) Storage.LOG).getLogger().setLevel(Level.OFF); But I’m still missing some loggers... you need a log4j.properties file on the CP that doesn't log so much. I do this by -removing /logj4.properties from the Hadoop jars in our (private) jar repository -having custom log4.properties files in the test/ source trees You could also start junit with the right log4j properties to point it at a custom log4j file. I forget what that property is.
Re: Namenode Scalability
On 17/08/11 08:48, Dieter Plaetinck wrote: Hi, On Wed, 10 Aug 2011 13:26:18 -0500 Michel Segelmichael_se...@hotmail.com wrote: This sounds like a homework assignment than a real world problem. Why? just wondering. The question proposed a data rate comparable with Yahoo, Google and Facebook --yet it was ingress rather than egress, which was even more unusual. You'd have to be doing a web-scale search engine to need that data rate -and if you were doing that you need to know a lot more about how Hadoop works (i.e. the limited role of the NN). You'd also have to addressed the entire network infrastructure, the costs of the work on your external system, DNS load, power budget. Oh, and the fact that unless you were processing discarding those PB/day at the rate of ingress, you'd need to add a new Hadoop cluster at a rate of 1 cluster/month, which is not only expensive, I don't think datacentre construction rates could handle it, even if your server vendor had set up a construction/test pipeline to ship down an assembled and test containerised cluster every few weeks (which we can do, incidentally :) I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-) huh? Different Homework questions.
Re: hadoop cluster mode not starting up
On 16/08/11 11:02, A Df wrote: Hello All: I used a combination of tutorials to setup hadoop but most seems to be using either an old version of hadoop or only using 2 machines for the cluster which isn't really a cluster. Does anyone know of a good tutorial which setups multiple nodes for a cluster?? I already looked at the Apache website but it does not give sample values for the conf files. Also each set of tutorials seem to have a different set of parameters which they indicate should be changed so now its a bit confusing. For example, my configuration sets a dedicate namenode, secondary namenode and 8 slave nodes but when I run the start command it gives an error. Should I install hadoop to my user directory or on the root? I have it in my directory but all the nodes have a central file system as opposed to distributed so whatever I do on one node in my user folder it affect all the others so how do i set the paths to ensure that it uses a distributed system? For the errors below, I checked the directories and the files are there. Am I not sure what went wrong and how to set the conf to not have central file system. Thank you. Error message CODE w1153435@n51:~/hadoop-0.20.2_cluster bin/start-dfs.sh bin/start-dfs.sh: line 28: /w1153435/hadoop-0.20.2_cluster/bin/hadoop-config.sh: No such file or directory bin/start-dfs.sh: line 50: /w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemon.sh: No such file or directory bin/start-dfs.sh: line 51: /w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh: No such file or directory bin/start-dfs.sh: line 52: /w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh: No such file or directory CODE there's No such file or directory as /w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh I had tried running this command below earlier but also got problems: CODE w1153435@ngs:~/hadoop-0.20.2_cluster export HADOOP_CONF_DIR=${HADOOP_HOME}/conf w1153435@ngs:~/hadoop-0.20.2_cluster export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves w1153435@ngs:~/hadoop-0.20.2_cluster ${HADOOP_HOME}/bin/slaves.sh mkdir -p /home/w1153435/hadoop-0.20.2_cluster/tmp/hadoop -bash: /bin/slaves.sh: No such file or directory w1153435@ngs:~/hadoop-0.20.2_cluster export HADOOP_HOME=/home/w1153435/hadoop-0.20.2_cluster w1153435@ngs:~/hadoop-0.20.2_cluster ${HADOOP_HOME}/bin/slaves.sh mkdir -p /home/w1153435/hadoop-0.20.2_cluster/tmp/hadoop cat: /conf/slaves: No such file or directory CODE there's No such file or directory as /conf/slaves because you set HADOOP_HOME after setting the other env variables, which are expanded at set-time, not run-time.
Re: Help on DFSClient
On 06/08/2011 20:41, jagaran das wrote: I am keeping a Stream Open and writing through it using a multithreaded application. The application is in a different box and I am connecting to NN remotely. I was using FileSystem and getting same error and now I am trying DFSClient and getting the same error. When I am running it via simple StandAlone class, it is not throwing any error but when i put that in my Application, it is throwing this error. That's just logging at trace level where the LeaseChecker was created. Your app is set up to log at TRACE, and you are getting more diagnostics than normal. 06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait for lease checker to terminate 06Aug2011 12:29:24,346 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: java.lang.Throwable: for testing at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181) at org.apache.hadoop.util.Daemon.init(Daemon.java:38) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442) at com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74) at com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95) at com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41) at com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61) at com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276) at com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93) at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506) at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463) at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868) at java.lang.Thread.run(Thread.java:680)
Re: Namenode Scalability
On 10/08/2011 08:58, jagaran das wrote: In my current project we are planning to streams of data to Namenode (20 Node Cluster). Data Volume would be around 1 PB per day. But there are application which can publish data at 1GBPS. That's Gigabyte/s or Gigabit/s? Few queries: 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in. see below 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly. that won't solve your problem 3. Can multiple region servers of HBase help us ?? no Please suggest how we can design the streaming part to handle such scale of data. Data is written to datanodes, not namenodes. the NN is used to set up the write chain and then just tracks node health -the data does not go through it. This changes your problem to one of -can the NN set up write chains at the speed you want, or do you need to throttle back the file creation rate by writing bigger files -can the NN handle the (file x block count) volumes you expect -what is the network traffic of the data ingress -what is the total bandwidth of the replication traffic combined with the data ingress traffic? -do you have enough disks for the data -do your HDDs have enough bandwidth? -do you want to do any work with the data, and what CPU/HDD/net load does this generate? -what impact will disk datanode replication traffic have? -how much of the backbone will you have to allocated to the rebalancer. A 1 PB/day, ignoring all network issues, you will reach the current documented HDFS limits within four weeks. What are you going to do then, or will you have processed it down? I could imagine some experiments you could conduct against a namenode to see what its limits are, but there are lot of datacentre bandwidth and computation details you have to worry above and beyond datanode performance issues. Like Michael says, 1 PB/day sounds like a homework project, especially if you haven't used hadoop at smaller scale. If it is homework, once you've done the work (and submitted it), it'd be nice to see the final paper. If it is something you plan to take live, well, there are lots of issues to address, of which the NN is just one of the issues -and one you can test in advance. Ramping up the cluster with different loads will teach you more about the bottlenecks. Otherwise: there are people who know how to run Hadoop at scale, who, in exchange for money, will help you. -steve
Re: Invalid link http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar during ivy download whiling mumak build.
On 30/07/11 06:30, arun k wrote: Hi all ! I have added the following code to build.xml and tried to build : $ant package. I have also tried to remove removed the entire ivy2 (~/.ivy2/* ) directory and rebuild but couldn't succeed. setproxy proxyhost=192.168.0.90 proxyport=8080 proxyuser=ranam proxypassword=passwd nonproxyhosts=xyz.svn.com / I get the error UNRESOLVED DEPENDENCIES. I have attached the log file. The artifact is there, so it's a proxy problem export $ANT_OPTS = -Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 These don't set ant properties, they set JVM options, and do work for Hadoop builds
Re: The best architecture for EC2/Hadoop interface?
On 02/08/11 05:09, Mark Kerzner wrote: Hi, I want to give my users a GUI that would allow them to start Hadoop clusters and run applications that I will provide on the AMIs. What would be a good approach to make it simple for the user? Should I write a Java Swing app that will wrap around the EC2 commands? Should I use some more direct EC2 API? Or should I use a web browser interface? My idea was to give the user a Java Swing GUI, so that he gives his Amazon credentials to it, and it would be secure because the application is not exposed to the outside. Does this approach make sense? 1. I'm not sure that Java Swing GUI makes sense for anything anymore -if it ever did. 2. Have a look at what other people have done first before writing your own. Amazon provide something for their derivative of Hadoop, Elastic MR, I suspect KarmaSphere and others may provide UIs in front of it too. the other thing is most big jobs are more than one operation, so you are a workflow world. Things like cascading pig and oozie help here, and if you can bring them up in-cluster, you can get a web UI.
Re: Submitting and running hadoop jobs Programmatically
On 27/07/11 05:55, madhu phatak wrote: Hi I am submitting the job as follows java -cp Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/* com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv kkk11fffrrw 1 My code to submit jobs (via a declarative configuration) is up online http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup It's LGPL, but ask nicely and I'll change the header to Apache. That code doesn't set up the classpath by pushing out more JARs (I'm planning to push out .groovy scripts instead), but it can also poll for job completion, take a timeout (useful in small test runs), and do other things. I currently mainly use it for testing
Re: Job progress not showing in Hadoop Tasktracker web interface
On 20/07/11 06:11, Teng, James wrote: You can't run a hadoop job in eclipse, you have to set up an environment on linux system. Maybe you can try to install it on WMware linux system and run the job in pseudo-distributed system. Actually you can bring up a MiniMRCluster in your JUnit test run (if hadop-core-test is on the Classpath) and run simple jobs against that. This is the standard way that Hadoop tests itself. It's not that high performing, doesn't scale out and can leak threads, but it's ideal for basic testing
Re: error of loading logging class
On 20/07/11 07:16, Juwei Shi wrote: Hi, We faced a problem of loading logging class when start the name node. It seems that hadoop can not find commons-logging-*.jar We have tried other commons-logging-1.0.4.jar and commons-logging-api-1.0.4.jar. It does not work! The following are error logs from starting console: I'd drop the -api file as it isn't needed, and as you say, avoid duplicate versions. Make sure that log4j is at the same point in the class hierarchy too (e.g in hadoop/lib) to debug commons logging, tell it to log to stderr. It's useful in emergencies -Dorg.apache.commons.logging.diagnostics.dest=STDERR
Re: Which release to use?
On 19/07/11 12:44, Rita wrote: Arun, I second Joeś comment. Thanks for giving us a heads up. I will wait patiently until 0.23 is considered stable. API-wise, 0.21 is better. I know that as I'm working with 0.20.203 right now, and it is a step backwards. Regarding future releases, the best way to get it stable is participate in release testing in your own infrastructure. Nothing else will find the problems unique to your setup of hardware, network and software
Re: Which release to use?
On 16/07/2011 16:53, Rita wrote: I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? I think you have to pay to use it
Re: Cluster Tuning
On 08/07/2011 16:25, Juan P. wrote: Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints take a look to see if its usually the same machine that's taking too long; test your HDDs to see if there are any signs of problems in the SMART messages. Then turn on speculation. It could be the problem with a slow mapper is caused by disk problems or an overloaded server.
Re: Which release to use?
On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)
Re: Which release to use?
On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun makes a good point which is that the Apache project depends on contributions from the community to thrive. That includes -bug reports -patches to fix problems -more tests -documentation improvements: more examples, more on getting started, troubleshooting, etc. If there's something lacking in the codebase, and you think you can fix it, please do so. Helping with the documentation is a good start, as it can be improved, and you aren't going to break anything. Once you get into changing the code, you'll end up working with the head of whichever branch you are targeting. The other area everyone can contribute on is testing. Yes, Y! and FB can test at scale, yes, other people can test large clusters too -but nobody has a network that looks like yours but you. And Hadoop does care about network configurations. Testing beta and release candidate releases in your infrastructure, helps verify that the final release will work on your site, and you don't end up getting all the phone calls about something not working
Re: Hadoop cluster hardware details for big data
On 06/07/11 11:43, Karthik Kumar wrote: Hi, Has anyone here used hadoop to process more than 3TB of data? If so we would like to know how many machines you used in your cluster and about the hardware configuration. The objective is to know how to handle huge data in Hadoop cluster. This is too vague a question. What do you mean process?. Scan through some logs looking for values? You could do that on a single machine if you weren't in a rush and you have enough disks, you'd just be very IO bound, and to be honest HDFS needs a minimum number of machines to become fault tolerant. Do complex matrix operations that use lots of RAM and CPU? You'll need more machines. If your cluster has a blocksize of 512MB then a 3TB file fits into (3*1024*1024)/512 blocks: 6144. so you can't have more than 6144 machines anyway -that's your theoretical maximum, even if your name is Facebook or Yahoo! What you are looking for is something in between 10 and 6144, the exact number driven by -how much compute you need to do, and how fast you want it done (controls #of CPUs, RAM) -how much total HDD storage you anticipate needing -whether you want to do leading-edge GPU work (good performance on some tasks, but limited work per machine) You can use benchmarking tools like gridmix3 to get some more data on the characteristics of your workload, which you can then take to your server supplier to say this is what we need, what can you offer? Otherwise everyone is just guessing. Remember also that you can add more racks later, but you will need to plan ahead on datacentre space, power and -very importantly- how you are going to expand the networking. Life is simplest if everything fits into one rack, but if you plan to expand you need to have a roadmap of how to connect that rack to some new ones, which means adding fast interconnect between different top of rack switches. You also need to worry about how to get data in and out fast. -Steve
Re: Hadoop cluster hardware details for big data
On 06/07/11 11:43, Karthik Kumar wrote: Hi, Has anyone here used hadoop to process more than 3TB of data? If so we would like to know how many machines you used in your cluster and about the hardware configuration. The objective is to know how to handle huge data in Hadoop cluster. Actually, I've just thought of simpler answer. 40. It's completely random, but if said with confidence it's as valid as any other answer to your current question.
Re: Hadoop cluster hardware details for big data
On 06/07/11 13:18, Michel Segel wrote: Wasn't the answer 42? ;-P 42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the worker nodes Looking at your calc... You forgot to factor in the number of slots per node. So the number is only a fraction. Assume 10 slots per node. (10 because it makes the math easier.) I thought something was wrong. Then I thought of the server revenue and decided not to look that hard.
Re: error in reduce task
On 24/06/11 18:16, Niels Boldt wrote: Hi, I'm running nutch in pseudo cluster, eg all daemons are running on the same server. I'm writing to the hadoop list, as it looks like a problem related to hadoop Some of my jobs partially fails and in the error log I get output like 2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201106231520_0190_r_00_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201106231520_0190_r_00_0 copy failed: attempt_201106231520_0190_m_00_0 from worker1 2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: worker1 The above basically said that my worker is unknown, but I can't really make any sense of it. Other jobs running before, at the same time or after completes fine without any error messages and without any changes on the server. Also other reduce task in the same run has succeded. So it looks like that my worker sometimes 'disappear' and can't be reached. If the worker had disappeared of the net, you'd be more likely to see a NoRouteToHost My current theory is that it only happens when there are a couple of jobs running at the same time. Is that a plausible explanation Would anybody have some suggestions how I could get more infomation from the system, or point me in a direction where I should look(I'm also quite new to hadoop) I'd assume that one machine in the cluster doesn't have an /etc/hosts entry to worker1, or that the DNS server is suffering under load. If you can, put the host lists into the /etc/hosts table instead of relying on DNS. If you do it on all machines, it avoids having to work out which one is playing up. That said, some better logging of which host is trying to make the connection would be nice
Re: Upgrading namenode/secondary node hardware
On 16/06/11 14:19, MilleBii wrote: But if my Filesystem is up running fine... do I have to worry at all or will the copy (ftp transfer) of hdfs will be enough. I'm not going to make any predictions there as if/when things go wrong -you do need to shut down the FS before the move -you ought to get the edit logs replayed before the move -you may want to try experimenting with copying the namenode data and bringing up the namenode (without any datanodes connected to, so it comes up in safe mode), to make sure everything works. I'd also worry that if you aren't familiar with the edit log, you may need to spend some time learning the subtle details of namenode journalling, replaying, backup and restoration, and what the secondary namenode does. It's easy to bring up a cluster and get overconfident that it works, right up to the moment it stops working. Experiment with your cluster's and teams' failure handling before you really need it 2011/6/16 Steve Loughranste...@apache.org On 15/06/11 15:54, MilleBii wrote: Thx. #1 don't understand the edit logs remark. well, that's something you need to work on as its the key to keeping your cluster working. The edit log is the journal of changes made to a namenode, which gets streamed to HDD and your secondary Namenode. After a NN restart, it has to replay all changes since the last checkpoint to get its directory structure up to date. Lose the edit log and you may as well reformat the disks.
Re: Upgrading namenode/secondary node hardware
On 15/06/11 15:54, MilleBii wrote: Thx. #1 don't understand the edit logs remark. well, that's something you need to work on as its the key to keeping your cluster working. The edit log is the journal of changes made to a namenode, which gets streamed to HDD and your secondary Namenode. After a NN restart, it has to replay all changes since the last checkpoint to get its directory structure up to date. Lose the edit log and you may as well reformat the disks.
Re: Upgrading namenode/secondary node hardware
On 14/06/11 22:01, MilleBii wrote: I want/need to upgrade my namenode/secondary node hardware. Actually also acts as one of the datanodes. Could not find any how-to guides. So what is the process to switch from one hardware to the next. 1. For HDFS data : is it just a matter of copying all the hdfs data from old server to new server. yes, put it in the same place on your HA storage and you may not even need to reconfigure it. If you didn't shut down the filesystem cleanly, you'll need to replay the edit logs. 2. what about the decommissioning procedure of data node, is it necessary in that case ? You shouldn't need to. This is no different from handling failover of a namenode, which you ought to try from time to time anyway, with two common tactics -have ready-to-go replacement servers with the same hostname/IP and shared storage -have ready-to-go replacement servers with different hostnames, then with your cluster management tools bounce the workers into a new configuration. 3.For MapRed: need to change the master in cluster configuration files I'd give the new boxes the same hostnames and IPAddresses as before, and nothing else will notice. And I recommend having good cluster management tooling anyway, of course.
Re: Hadoop on windows with bat and ant scripts
On 13/06/11 15:27, Bible, Landy wrote: On 06/13/2011 07:52 AM, Loughran, Steve wrote: On 06/10/2011 03:23 PM, Bible, Landy wrote: I'm currently running HDFS on Windows 7 desktops. I had to create a hadoop.bat that provided the same functionality of the shell scripts, and some Java Service Wrapper configs to run the DataNodes and NameNode as windows services. Once I get my system more functional I plan to do a write up about how I did it, but it wasn't too difficult. I'd also like to see Hadoop become less platform dependent. why? Do you plan to bring up a real Windows server datacenter to test it on? Not a datacenter, but a large-ish cluster of desktops, yes. Whether you like it or not, all the big Hadoop clusters run on Linux I realize that, I use Linux wherever possible, much to the annoyance of my Windows only co-workers. However, for my current project, I'm using all the Windows 7 and Vista desktops at my site as a storage cluster. The first idea was to run Hadoop on Linux in a VM in the background on each desktop, but that seemed like overkill. The point here is to use the resources we have but aren't using, rather than buy new resources. Academia is funny like that. I understand. One trick my local university has done is to buy a set of servers with HDDs for their HDFS filestore, but also hook them up to their grid scheduler (condor? Torque?) so the existing grid jobs see a set of machines for their work, while the Job tracker sees a farm of worker nodes with local data. Some more work there on reporting busy-state to each job scheduler would be nice, so that the Task Trackers would say busy when running grid jobs, and vice-versa So far, I've been unable to make MapReduce work correctly. The services run, but things don't work, however I suspect that this is due to DNS not working correctly in my environment. yes, that's part of the anywhere you have to fix. Edit the host tables so that DNS and reverse DNS appears to work. That's c:\windows\system32\drivers\etc\hosts, unless on a win64 box it moves. Why does Hadoop even care about DNS? Every node checks in with the NameNode and JobTrackers, so they know where they are, why not just go pure IP based and forget DNS. Managing the hosts file is a pain... even when you automate it, it just seems unneeded. there's been some fixes in 0.21 and 0.22, but still there may be a tendency to look things up. https://issues.apache.org/jira/browse/HADOOP-3426 https://issues.apache.org/jira/browse/HADOOP-7104 Hadoop doesn't like coming up on multi-homed servers or having separate in-cluster and long-haul hostnames. Yes, this all needs fixing. I think the reason it hasn't been fixed is that the big datacentres do have well configured networks, caching DNS servers in every worker node, etc, and all is well. It's the home networks and the less-consistently set up ones (mine, and perhaps yours) where the trouble shows up. We get to file the bugs and fix the problems.
Re: Hadoop on windows with bat and ant scripts
On 06/10/2011 03:23 PM, Bible, Landy wrote: Hi Raja, I'm currently running HDFS on Windows 7 desktops. I had to create a hadoop.bat that provided the same functionality of the shell scripts, and some Java Service Wrapper configs to run the DataNodes and NameNode as windows services. Once I get my system more functional I plan to do a write up about how I did it, but it wasn't too difficult. I'd also like to see Hadoop become less platform dependent. why? Do you plan to bring up a real Windows server datacenter to test it on? Java is supposed to be Write Once - Run Anywhere, but a lot of java projects seem to forget that. Java can be x-platform, but you have to consider the problems of testing on hundreds of machines, the fact that even System.execute() behaves differently on different systems, the networking setup and behaviour of windows is very different from Unix, etc. Whether you like it or not, all the big Hadoop clusters run on Linux, not just for the licensing costs, but because it is what Hadoop is tested on at those scales, so it becomes self-reinforcing. Same for the JVM: Sun's standard JVM, not JRockit or anything else. Again, in a large datacenter you will find all the corner cases where that runs anywhere claim changes to crashes one task tracker every hour. OS/X and Windows support is very much there for development, though even there I'd recommend switching to a Linux laptop to reduce the surprises when you go to the real cluster. Allen W will note that Solaris works too, but even then differences between Linux and SunOS caused problems. By having a de-facto agreement to focus on Linux as the back end, it lets the developers * have a single platform to dev and test on * worry about RPM and deb installers, not windows install/uninstall quirks. * share ready-to-use Linux VM images (as Cloudera do) for people to play with. * use the large cluster management tooling that exists for managing big Linux clusters (Kickstart, etc). I think it's important is for the client-side code to work on windows, for job submission to be x-platform, but getting server-side code to work well on windows is a lot harder than people expect. The OS wasn't really written for it, the Java Service Wrappers have their own issues (both the Apache one, which is derived from Tomcat, and the other one), and it's not something I'd recommend to go near unless you really have no choice in the matter. I speak from experience. Sorry. So far, I've been unable to make MapReduce work correctly. The services run, but things don't work, however I suspect that this is due to DNS not working correctly in my environment. yes, that's part of the anywhere you have to fix. Edit the host tables so that DNS and reverse DNS appears to work. That's c:\windows\system32\drivers\etc\hosts, unless on a win64 box it moves.
Re: NameNode heapsize
On 06/10/2011 05:31 PM, si...@ugcv.com wrote: I would add more RAM for sure but there's hardware limitation. How if the motherboard couldn't support more than ... say 128GB ? seems I can't keep adding RAM to resolve it. compressed pointers, do u mean turning on jvm compressed reference ? I didn't try that out before, how's your experience ? JVMs top out at 64GB, I think, while compressed pointers only work on sun VMs when the heap is under 32GB. JRockit has better heap management, but as I was the only person to admit to using Hadoop on JRockit, I know you'd be on your own if you found problems there. Unless your cluster is bigger than Facebooks, you have too many small files
Re: Hadoop on windows with bat and ant scripts
On 06/12/2011 03:01 AM, Raja Nagendra Kumar wrote: Hi, I see hadoop would need unix (on windows with Cygwin) to run. It would be much nice if Hadoop gets away from the shell scripts though appropriate ant scripts or with java Admin Console kind of model. Then it becomes lighter for development. The overhead of executing ant vs shell scripts would make things worse on Linux clusters. The cost of installing cygwin on developer desktops isn't that high. There's been discussions for a better native library for Hadoop operations, but it would be biased towards Unix (POSIX file permissions, paths, etc).
Re: Why inter-rack communication in mapreduce slow?
On 06/06/2011 02:40 PM, John Armstrong wrote: On Mon, 06 Jun 2011 09:34:56 -0400,dar...@ontrenet.com wrote: Yeah, that's a good point. In fact, it almost makes me wonder if an ideal setup is not only to have each of the main control daemons on their own nodes, but to put THOSE nodes on their own rack and keep all the data elsewhere. I'd give them 10Gbps connection to the main network fabric, as with any ingress/egress nodes whose aim in life is to get data into and out of the cluster. There's a lot to be said for fast nodes within the datacentre but not hosting datanodes, as that way their writes get scattered everywhere -which is what you need when loading data into HDFS. You don't need separate racks for this, just more complicated wiring. -steve (disclaimer, my network knowledge generally stops at Connection Refused and No Route to Host messages)
Re: Hadoop Cluster Multi-datacenter
On 06/07/2011 06:07 AM, sanjeev.ta...@us.pwc.com wrote: Hello, I wanted to know if anyone has any tips or tutorials on howto install the hadoop cluster on multiple datacenters Nobody has come out and said they've built a single HDFS filesystem from multiple sites, primarly because the inter-site bandwidth/latency will be awful and there isn't any support for this in the topology model of Hadoop (there are some placeholders though). You could set up an HDFS filesystem in each datacentre, and use symbolic links (or the forthcoming federation) to pull data in. There's no reason why you can't start up a job on Datacentre-1 that starts reading some of its data from DC-2, after which all the work will be datacentre-local. Do you need ssh connectivity between the nodes across these data centers? Depends on how you deploy Hadoop. You only need SSH if you use the built-in tooling; if you use large scale cluster management tools then it's a non-issue.
Re: NameNode is starting with exceptions whenever its trying to start datanodes
On 06/07/2011 10:50 AM, praveenesh kumar wrote: The logs say The ratio of reported blocks 0.9091 has not reached the threshold 0.9990. Safe mode will be turned off automatically. not enough datanodes reported in, or they are missing data
Re: Why inter-rack communication in mapreduce slow?
On 06/06/11 08:22, elton sky wrote: hello everyone, As I don't have experience with big scale cluster, I cannot figure out why the inter-rack communication in a mapreduce job is significantly slower than intra-rack. I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not be much contention at the switch, is it? I don't know enough about these switches; I do hear stories about buffering and the like, and I also hear that a lot of switches don't always expect all the ports to light up simultaneously. Outside hadoop, try setting up some simple bandwidth tests to measure inter-rack bandwidth: have every node on one rack try and talk to one on another at full rate. Set up every node talking to every other node at least once, to make sure there aren't odd problems between two nodes, which can happen if one of the NICs is playing up. Once you are happy that the basic bandwidth between servers is OK, then it's time to start worrying adding hadoop to the mix -steve
Re: Starting a Hadoop job outside the cluster
My Job submit code is http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/ something to run tool classes http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/ToolRunnerComponentImpl.java?revision=8590view=markup something to integrate job submission with some pre-run sanity checks, and to optionally wait for the work to finish http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup works remotely for short-lived jobs; if you submit something that may run over a weekend you don't normally want to block for it
Re: Identifying why a task is taking long on a given hadoop node
On 03/06/2011 12:24, Mayuresh wrote: Hi, I am really having a hard time debugging this. I have a hadoop cluster and one of the maps is taking time. I checked the datanode logs and can see no activity for around 10 minutes! The usual cause here is imminent disk failure, as reads start to take longer and longer. look at your SMART disk logs, do some performance tests of all the drives
Re: java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal
On 05/26/2011 07:45 PM, subhransu wrote: Hello Geeks, I am a new bee to use hadoop and i am currently installed hadoop-0.20.203.0 I am running the sample programs part of this package but getting this error Any pointer to fix this ??? ~/Hadoop/hadoop-0.20.203.0 788 bin/hadoop jar hadoop-examples-0.20.203.0.jar sort java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal at org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:246) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:437) at org.apache.hadoop.examples.Sort.run(Sort.java:82) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Sort.main(Sort.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at you're running the IBM JVM. https://issues.apache.org/jira/browse/HADOOP-7211 Go to the IBM web site and download their slightly-modified version of Hadoop that works with their JVM, or switch to the Sun JVM, which is the only one that Hadoop is rigorously tested on. Sorry. -steve
Re: Hadoop and WikiLeaks
On 23/05/11 01:10, Edward Capriolo wrote: Correct. But it is a place to discuss changing the content of http://hadoop.apache.org which is what I am advocating. Todd's going to fix it. I just copied and pasted in the newspaper quote: it's not that I wanted to make any statement whatsoever, I just slapped in what the paper said. That's one of the English papers that isn't allowed to say which soccer player has been playing away from home, or reprint any part of the twitter query http://twitter.com/#search?q=%23superinjunction . When Britain has secrets, see, they're shallow and meaningless. For anyone who wants to see what the award looks like, here's Owen holding it http://www.flickr.com/photos/steve_l/5560919533/in/set-72157626356732562 apparently people kept coming up to talk to Jakob when he was walking round the cube, though once he started to explain about distributed file systems and resilient application execution through idempotent operations executed near the data they always ran off. -steve
Re: Exception in thread AWT-EventQueue-0 java.lang.NullPointerException
On 16/05/11 21:12, Lạc Trung wrote: I'm using Hadoop-0.21. --- hut.edu.vn At the top, it's your code, so you get to fix it. The good thing about open source is you can go all the way in. This is what I would do in the same situation -Grab the 0.21 source JAR -add it your IDE -have a look at what line it's NPEing on -work out what variable's being null would trigger that. Usually its whatever object has just had a method called on it. -work out why that's null (usually some config/startup thing was missing). -If its a repeatable problem, try adding some checks in advance, dump state to the console, etc. -fix what is probably a setup problem If it's some other problem (race condition etc), life is harder -steve
Re: Cluster hard drive ratios
On 05/05/11 19:14, Matthew Foley wrote: a node (or rack) is going down, don't replicate == DataNode Decommissioning. This feature is available. The current usage is to add the hosts to be decommissioned to the exclusion file named in dfs.hosts.exclude, then use DFSAdmin to invoke -refreshNodes. (Search for decommission in DFSAdmin source code.) NN will stop using these servers as replication targets, and will re-replicate all their replicas to other hosts that are still in service. The count of nodes that are in the process of being decommissioned is reported in the NN status web page. I'm thinking more of don't overreact to 50 machines going offline by rebalancing -all copies whose replication count has just dropped by 1, not until the rack has been offline for 30 minutes.
Re: Cluster hard drive ratios
On 04/05/11 19:59, Matt Goeke wrote: Mike, Thanks for the response. It looks like this discussion forked on the CDH list so I have two different conversations now. Also, you're dead on that one of the presentations I was referencing was Ravi's. With your setup I agree that it would have made no sense to go the 2.5 drive route given it would have forced you into the 500-750GB SATA drives and all it would allow is more spindles but less capacity at a higher cost. The servers we have been considering are actually the R710's so dual hexacore with 12 spindles of actual capacity is more of a 1:1 in terms of cores to spindles vs the 2:1 I have been reviewing. My original issue attempted to focus more around at what point do you actually see a plateau in write performance of cores:spindles but since we are headed that direction anyway it looks like it was more to sate curiosity than driving specifications. some people are using this as it gives best storage density. You can also go for single hexacore servers as in a big cluster the savings there translate into even more storage. It all depends on the application. As to your point, I forgot to include the issue of rebalancing in the original email but you are absolutely right. That was another major concern especially as we would get closer to filling capacity of a 24TB box. I think the original plan was bonded GBe but I think our infrastructure team has told us 10GBe would be standard. 1. If you want to play with bonded GBe then I have some notes I can send you -its harder than you think. 2. I don't know anyone who is running 10 GBe + Hadoop, though I see hints that StumbleUpon are doing this with Arista switches. You'd have to have a very chatty app or 10GBe on the mainboard to justify it. 3. I do know of installations with 24TB HDD and GBe, yes, the overhead of a node failure is higher. But with less nodes, P(failure) may be lower. The big fear is loss-of-rack, which can come from ToR switch failure or from network config errors. Hadoop isn't partition aware will treat a rack outage as the loss of 40+ servers, try to replicate all that data, and that's when you're in trouble (look at the AWS EBS outage for an example cascade failure). 4. There are JIRA issues for better handling of drive failure, including hotswapping and rebalancing data within a single machine. 5. I'd like support for the ability to say a node is going down, don't replicate, and the same for a rack, to ease maintenance. -Steve
Re: Applications creates bigger output than input?
On 30/04/2011 05:31, elton sky wrote: Thank you for suggestions: Weblog analysis, market basket analysis and generating search index. I guess for these applications we need more reduces than maps, for handling large intermediate output, isn't it. Besides, the input split for map should be smaller than usual, because the workload for spill and merge on map's local disk is heavy. any form of rendering can generate very large images see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
Re: Execution time.
On 26/04/11 14:16, real great.. wrote: Thanks a lot.I have managed to do it. And my final year project is on power aware Hadoop. i do realise its against ethics to get the code that way..:) Good. What do you mean by power aware -awareness of the topology of UPS sources inside a datacentre -awareness of CPU voltage level/power drain to schedule work where CPUs are capable of being most efficiently used, rather than scheduling work on a CPU that will have to ramp up to its full voltage and so be least efficient? either would be interesting. You could use the existing rack topology scripts for UPS topology, but really there should be two topologies, as it's block placement where you need the UPS topology
Re: Cluster hardware question
On 26/04/11 14:55, Xiaobo Gu wrote: Hi, People say a balanced server configration is as following: 2 4 Core CPU, 24G RAM, 4 1TB SATA Disks But we have been used to use storages servers with 24 1T SATA Disks, we are wondering will Hadoop be CPU bounded if this kind of servers are used. Does anybody have experiences with hadoop running on servers with so many disks. Some of the new clusters are running one or two 6 core CPUs with 12*2TB 3.5 HDDs for storage, as this gives maximum storage density (it fits in a 1U). The exact ratio of CPU:RAM:disk depends on the application. What you get with the big servers is -more probability of local access -great IO bandwidth, especially if you set up the mapred.temp.dir value to include all the drives. -less servers means less network ports on the switches, so you can save some money in the network fabric, and in time/effort cabling everything up. What do you lose? -in a small cluster, loss of a single machine matters -in a large cluster, loss of a single machine can generate up to 24TB of replication traffic (more once 3TB HDDs become affordable) -in a large cluster, loss of a rack (or switch) can generate a very large amount of traffic. If you were building a large (muti PB) cluster, this design is good for storage density -you could get a petabyte in a couple of racks, though the replication costs of a Top of Rack switch failure might push you towards 2xToR switches and bonded NICs, which introduce a whole new set of problems. For smaller installations? I don't know. -Steve
Re: Unsplittable files on HDFS
On 27/04/11 10:48, Niels Basjes wrote: Hi, I did the following with a 1.6GB file hadoop fs -Ddfs.block.size=2147483648 -put /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes and I got Total number of blocks: 1 4189183682512190568:10.10.138.61:50010 10.10.138.62:50010 Yes, that does the trick. Thank you. Niels 2011/4/27 Harsh Jha...@cloudera.com: Hey Niels, The block size is a per-file property. Would putting/creating these gzip files on the DFS with a very high block size (such that it doesn't split across for such files) be a valid solution to your problem here? Don't set a block size 2GB, not all the bits of the code that use signed 32 bit integers have been eliminated yet.
Re: Seeking Advice on Upgrading a Cluster
On 21/04/11 18:33, Geoffry Roberts wrote: What will give me the most bang for my buck? - Should I bring all machines up to 8G of memory? or is 4G good enough? (8 is the max.) depends on whether your code is running out of memory - Should I double up the NICs and use LACP? I would only recommend this for increasing availability at the expense of time spent getting it all to work. - Should I double up the disks and attempt to flow my I/O from one disk to the another on the theory that this will minimizing contention? if your app is bandwidth bound (iotop should tell you this) then yes, this will help. - Should I get another switch? (I have a 10/100, 24 port Dlink and it's about 5 years old.) a gigabit switch is low cost now, I'd do that as one of my actions Why not do some experiments by going to a smaller cluster and doubling the RAM and HDD from that cluster with those from your existing machines, and see which benefits your code the most?
Re: Fixing a bad HD
On 26/04/11 05:20, Bharath Mundlapudi wrote: Right, if you have a hardware which supports hot-swappable disk, this might be easiest one. But still you will need to restart the datanode to detect this new disk. There is an open Jira on this. -Bharath That'll be HDFS-664 https://issues.apache.org/jira/browse/HDFS-664 Nobody is working on this, all contributions welcome
Re: Fixing a bad HD
On 26/04/11 05:20, Bharath Mundlapudi wrote: Right, if you have a hardware which supports hot-swappable disk, this might be easiest one. But still you will need to restart the datanode to detect this new disk. There is an open Jira on this. -Bharath Correction, there is a patch up there now. If you wan't to get involved in the coding of Hadoop to meet your specific needs, this might be the place to start
Re: HOD exception: java.io.IOException: No valid local directories in property: mapred.local.dir
On 11/04/2011 16:48, Boyu Zhang wrote: Exception in thread main org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir The job tracker can't find any of the local filesystem directories listed in the mapred.local.dir property, either the conf file or the machine is misconfigured
Re: Reg HDFS checksum
On 12/04/2011 07:06, Josh Patterson wrote: If you take a look at: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java you'll see a single process version of what HDFS does under the hood, albeit in a highly distributed fashion. Whats going on here is that for every 512 bytes a CRC32 is calc'd and saved at each local datanode for that block. when the checksum is requested, these CRC32's are pulled together and MD5 hashed, which is sent to the client process. The client process then MD5 hashes all of these hashes together to produce a final hash. For some context: Our purpose on the openPDC project for this was we had some legacy software writing to HDFS through a FTP proxy bridge: https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/ Since the openPDC data was ultra critical in that we could not lose *any* data, and the team wanted to use a simple FTP client lib to write to HDFS (least amount of work for them, standard libs), we needed a way to make sure that no corruption occurred during the hop through the FTP bridge (acted as intermediary to DFSClient, something could fail, and the file might be slightly truncated, yet hard to detect this). In the FTP bridge we allowed a custom FTP command to call the now exposed hdfs-checksum command, and the sending agent could then compute the hash locally (in the case of the openPDC it was done in C#), and make sure the file made it there intact. This system has been in production for over a year now storing and maintaining smart grid data and has been highly reliable. I say all of this to say: After having dug through HDFS's checksumming code I am pretty confident that its Good Stuff, although I dont proclaim to be a filesystem expert by any means. It may be just some simple error or oversight in your process, possibly? Assuming it came down over HTTP, it's perfectly conceivable that something went wrong on the way, especially if a proxy server get involved. All HTTP checks is that the (optional) content length is consistent with what arrived -it relies on TCP checksums, which verify the network links work, but not the other bits of the system in the way (like any proxy server)
Re: Hadoop Pipes Error
On 31/03/11 07:53, Adarsh Sharma wrote: Thanks Amareshwari, here is the posting : The *nopipe* example needs more documentation. It assumes that it is run with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/ *WordCountInputFormat*.java, which has a very specific input split format. By running with a TextInputFormat, it will send binary bytes as the input split and won't work right. The *nopipe* example should probably be recoded *to* use libhdfs *too*, but that is more complicated *to* get running as a unit test. Also note that since the C++ example is using local file reads, it will only work on a cluster if you have nfs or something working across the cluster. Please need if I'm wrong. I need to run it with TextInputFormat. If posiible Please explain the above post more clearly. Here goes. 1. The *nopipe* example needs more documentation. It assumes that it is run with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/ *WordCountInputFormat*.java, which has a very specific input split format. By running with a TextInputFormat, it will send binary bytes as the input split and won't work right. The input for the pipe is the content generated by src/test/org/apache/hadoop/mapred/pipes/WordCountInputFormat.java This is covered here. http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0 I would recommend following the tutorial here, or either of the books Hadoop the definitive guide or Hadoop in Action. Both authors earn their money by explaining how to use Hadoop, which is why both books are good explanations of it. 2. The *nopipe* example should probably be recoded *to* use libhdfs *too*, but that is more complicated *to* get running as a unit test. Ignore that -it's irrelevant for your problem as owen is discussing automated testing. 3. Also note that since the C++ example is using local file reads, it will only work on a cluster if you have nfs or something working across the cluster. unless your cluster has a shared filesystem at the OS level it won't work. Either have a shared filesystem like NFS, or run it on a single machine. -Steve
Re: does counters go the performance down seriously?
On 28/03/11 23:34, JunYoung Kim wrote: hi, this linke is about hadoop usage for the good practices. http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/ by Arun C Murthy if I want to use about 50,000 counters for a job, does it cause serious performance down? Yes, you will use up lots of JT memory and so put limits on the overall size of your cluster. If you have a small cluster and can crank up the memory settings on the JT to 48 GB this isn't going to be an issue, but as Y! are topping out at these numbers anyway, lots of counters just overload them.
Re: ant version problem
On 27/03/11 21:02, Daniel McEnnis wrote: Steve, Here it is: user@ubuntu:~/src/trunk$ ant -diagnostics --- Ant diagnostics report --- Apache Ant version 1.8.0 compiled on May 9 2010 --- Implementation Version --- core tasks : 1.8.0 in file:/usr/share/ant/lib/ant.jar optional tasks : 1.8.0 in file:/usr/share/ant/lib/ant-nodeps.jar OK, that's a linux distro install without any extra jars. There's enough in a JVM these days that the basic operations will all work. compile-rcc-compiler: [javac] /home/user/src/trunk/build.xml:333: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds BUILD FAILED /home/user/src/trunk/build.xml:338: taskdef A class needed by class org.apache.hadoop.record.compiler.ant.RccTask cannot be found: Task using the classloader Looking at that class, http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/src/core/org/apache/hadoop/record/compiler/ant/RccTask.java?view=markup I don't see any obvious dependencies on non-standard things, 20 import java.io.File; 21 import java.util.ArrayList; 22 import org.apache.hadoop.record.compiler.generated.Rcc; 23 import org.apache.tools.ant.BuildException; 24 import org.apache.tools.ant.DirectoryScanner; 25 import org.apache.tools.ant.Project; 26 import org.apache.tools.ant.Task; 27 import org.apache.tools.ant.types.FileSet; The only odd one is the generated Rcc taks. That taskdef message from ant is saying something imported is missing. Try running ant in debug mode (ant -debug) to see if it gives a clue, but there's very little that anyone else can do here.
Re: observe the effect of changes to Hadoop
On 25/03/2011 14:10, bikash sharma wrote: Hi, For my research project, I need to add a couple of functions in JobTracker.java source file to include additional information about TaskTrackers resource usage through heartbeat messages. I made those changes to JobTracker.java file. However, I am not very clear how to see these effects. I mean what are the next steps in terms of building the entire Hadoop code base, using the built distribution and installing it again in the cluster, etc? If you are working with the Job Tracker you only need to rebuild the mapreduce JARs and push the new JAR out to the Job Tracker server, restart that process. For more safety, put the same JAR on all the task trackers and shut down HDFS before the updates, but that's potentially overkil Any elaborate updates on these will be very useful since I do not have much experience in doing modifications to Hadoop like huge code base and observing the effects of these changes. I'd recommend getting everything working on a local machine single VM (the MiniMRCluster class helps), then move to multiple VMs and finally, if the code looks good, a real cluster with data you don't value. -stee
Re: ant version problem
On 27/03/2011 02:01, Daniel McEnnis wrote: Dear Hadoop, Which version of ant do I need to keep the hadoop build from failing. Netbeans ant works as well as eclipse ant works. However, ant 1.8.2 does not, nor does the default ant from Ubuntu 10.10. Snippet from failure to follow: 1.8.2 will work, but when you install via a linux distro the classpath is trickier to set up, as you need to get all dependent jars on the classpath. what does ant -diagnostics say? compile-rcc-compiler: [javac] /home/user/src/trunk/build.xml:333: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds BUILD FAILED /home/user/src/trunk/build.xml:338: taskdef A class needed by class org.apache.hadoop.record.compiler.ant.RccTask cannot be found: Task using the classloader AntClassLoader[/home/user/src/trunk/build/classes:/home/user/src/trunk/conf:/home/user/.ivy2/cache/commons-logging/commons-logging/jars/commons-logging-1.1.1.jar:/home/user/.ivy2/cache/log4j/log4j/jars/log4j-1.2.15.jar:/home/user/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/home/user/.ivy2/cache/commons-codec/commons-codec/jars/commons-codec-1.4.jar:/home/user/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/home/user/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/home/user/.ivy2/cache/net.java.dev.jets3t/jets3t/jars/jets3t-0.7.1.jar:/home/user/.ivy2/cache/commons-net/commons-net/jars/commons-net-1.4.1.jar:/home/user/.ivy2/cache/org.mortbay.jetty/servlet-api-2.5/jars/servlet-api-2.5-6.1.14.jar:/home/user/.ivy2/cache/net.sf.kosmosfs/kfs/jars/kfs-0.3.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jetty/jars/jetty-6.1.14.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jetty-util/jars/jetty-util-6.1.14.jar:/home/user/.ivy2/cache/t omcat/jasper-runtime/jars/jasper-runtime-5.5.12.jar:/home/user/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.12.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jsp-api-2.1/jars/jsp-api-2.1-6.1.14.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jsp-2.1/jars/jsp-2.1-6.1.14.jar:/home/user/.ivy2/cache/commons-el/commons-el/jars/commons-el-1.0.jar:/home/user/.ivy2/cache/oro/oro/jars/oro-2.0.8.jar:/home/user/.ivy2/cache/jdiff/jdiff/jars/jdiff-1.0.9.jar:/home/user/.ivy2/cache/junit/junit/jars/junit-4.8.1.jar:/home/user/.ivy2/cache/hsqldb/hsqldb/jars/hsqldb-1.8.0.10.jar:/home/user/.ivy2/cache/commons-logging/commons-logging-api/jars/commons-logging-api-1.1.jar:/home/user/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.5.11.jar:/home/user/.ivy2/cache/org.eclipse.jdt/core/jars/core-3.1.1.jar:/home/user/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.5.11.jar:/home/user/.ivy2/cache/org.apache.hadoop/avro/jars/avro-1.3.2.jar:/home/user/.ivy2/cache/org.codehaus.jackson/j ackson-mapper-asl/jars/jackson-mapper-asl-1.4.2.jar:/home/user/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.4.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer-ant/jars/paranamer-ant-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer-generator/jars/paranamer-generator-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.qdox/qdox/jars/qdox-1.10.1.jar:/home/user/.ivy2/cache/asm/asm/jars/asm-3.2.jar:/home/user/.ivy2/cache/commons-lang/commons-lang/jars/commons-lang-2.5.jar:/home/user/.ivy2/cache/org.aspectj/aspectjrt/jars/aspectjrt-1.6.5.jar:/home/user/.ivy2/cache/org.aspectj/aspectjtools/jars/aspectjtools-1.6.5.jar:/home/user/.ivy2/cache/org.mockito/mockito-all/jars/mockito-all-1.8.2.jar:/home/user/.ivy2/cache/com.jcraft/jsch/jars/jsch-0.1.42.jar] Total time: 6 seconds Sincerely, Daniel McEnnis.
Re: CDH and Hadoop
On 23/03/11 15:32, Michael Segel wrote: Rita, It sounds like you're only using Hadoop and have no intentions to really get into the internals. I'm like most admins/developers/IT guys and I'm pretty lazy. I find it easier to set up the yum repository and then issue the yum install hadoop command. The thing about Cloudera is that they do back port patches so that while their release is 'heavily patched'. But they are usually in some sort of sync with the Apache release. Since you're only working with HDFS and its pretty stable, I'd say go with the Cloudera release. to be fair, the Y! version of 0.20.x has all the backportings to do with scale, on a large cluster I'd pick up that one, with the understanding that if you have support problems, you can't pay Cloudera to hold your hand. If you have any plans to get involved in the Hadoop friends code, to move from a user to contributor, you should get with the official releases. Similarly, if you have some problem and want to file a bug, you should get the latest official release and test with that, as -that will be the first question on the bug report is it still there? -you'll need to help debug it. Going forward, there are plans to do RPM and ideally deb artifacts of 0.22 and later versions of Hadoop, making them easier to install. This still leaves the question of who supports it, the answers being you, or anyone you pay to, that being the way open source works -steve
Re: Creating bundled jar files for running under hadoop
On 22/03/11 13:34, Andy Doddington wrote: I am trying to create a bundled jar file for running using the hadoop ‘jar’ command. However, when I try to do this it fails to find the jar files and other resources that I have placed into the jar (pointed at by the Class-Path property in the MANIFEST.MF). I have unpacked the jar file into a temporary directory and run it manually using the java -jar command (after manually listing the various hadoop jar files that are required in the classpath) and it works fine in this mode. I have read on some places that the workaround is to put my jar files and other resources into the hadoop ‘lib’ directory, but I really don’t want to do this, since it feels horribly kludgy. Putting them in lib/ is bad as you need to restart the cluster to get changes out, and you can't have jobs with different versions in them use the -libjars option instead I don’t mind creating a ‘lib’ directory in my jar file, since this will be hidden, but I would appreciate more documentation as to why/how I need to do this. because the class-path manifest thing is designed for classloaders in the local filesystem where relative paths are used to find JARs and everything runs locally, and is only handled by the normal Java main entry point and the applet loader. Can anybody advise as to how I can get 'hadoop jar’ to honour the Class-Path entry in the MANIFEST.MF? You'd probably have to write the code to -parse the entry at job submission -find the JARs or fail -somehow include all these JARs in the list of dependencies for the job submission list via a transform to the -libjars command. -add the tests for this I'm not sure it's worth the effort.
Re: decommissioning node woes
On 19/03/11 16:00, Ted Dunning wrote: Unfortunately this doesn't help much because it is hard to get the ports to balance the load. On Fri, Mar 18, 2011 at 8:30 PM, Michael Segelmichael_se...@hotmail.comwrote: With a 1GBe port, you could go 100Mbs for the bandwidth limit. If you bond your ports, you could go higher. Port bonding is possible, its just harder to -set up all the cabling -be sure both ports are fully utilised It's less expensive than 10G ether because those switches cost a lot more, and with 2x1 you can have separate ToR switches for more redundancy. For decommissioning, why not boost the rebalance bandwidth before you trigger the decommission, then drop it afterwards. -steve
Re: Installing Hadoop on Debian Squeeze
On 21/03/11 09:00, Dieter Plaetinck wrote: On Thu, 17 Mar 2011 19:33:02 +0100 Thomas Kochtho...@koch.ro wrote: Currently my advise is to use the Debian packages from cloudera. That's the problem, it appears there are none. Like I said in my earlier mail, Debian is not in Cloudera's list of supported distros, and they do not have a repository for Debian packages. (I tried the ubuntu repository but that didn't work) I now have installed it by just downloading and extracting the tarball, it seems that's basically all that is needed. Dieter There's an open JIRA on having Apache release its own Hadoop RPMs, pushing out debian JIRAs would go alongside this, but that requires on someone else to volunteer the work...
Public Talks from Yahoo! and LinkedIn in Bristol, England, Friday Mar 25
This isn't relevant for people who don't live in or near South England or Wales, but for those that do, I'm pleased to announce that Owen O'Malley and Sanjay Radia of Yahoo! and Jakob Homan of LinkedIn will all be giving public talks on Hadoop on Friday March 25 at HP Laboratories, in Bristol. http://hphadoop.eventbrite.com/ If you can come along, great! If not, they are giving the same talks in London on the Wednesday -at a talk that is already booked up-, where Yahoo! will be videoing the talks for them to go up online afterwards. You'll all be able to catch up from the luxury of your own laptop. Looking forward to meeting other Hadoop developers and users next week, Steve
Re: Hadoop code base splits
On 17/03/11 07:05, Matthew John wrote: Hi, Can someone provide me some pointers on the following details of Hadoop code base: 1) breakdown of HDFS code base (approximate lines of code) into following modules: - HDFS at the Datanodes - Namenode - Zookeeper - MapReduce based - Any other relevant split 2) breakdown of Hbase code into following modules: - HMaster - RegionServers - MapReduce - Any other relevant split You are free to check out the source code and do whatever analysis you want. You can also look at the entire SVN history and do some really interesting analysis, especially if you have any data mining tooling to hand, like a small hadoop cluster.
Re: Load testing in hadoop
On 15/03/11 04:59, Kannu wrote: Please tell me how to use synthetic load generator in hadoop or suggest me any other way of load testing in hadoop cluster. thanks, kannu terasort is the one most people use, as it generates its own datasets. Otherwise you need a few TB of data and some custom code to run on it. Gridmix/gridmix2 can also simulate load. All this stuff is in the hadoop source tree, with documentation
Re: Speculative execution
On 02/03/11 21:01, Keith Wiley wrote: I realize that the intended purpose of speculative execution is to overcome individual slow tasks...and I have read that it explicitly is *not* intended to start copies of a task simultaneously and to then race them, but rather to start copies of tasks that seem slow after running for a while. ...but aside from merely being slow, sometimes tasks arbitrarily fail, and not in data-driven or otherwise deterministic ways. A task may fail and then succeed on a subsequent attempt...but the total job time is extended by the time wasted during the initial failed task attempt. yes, but the problem is determining which one will fail. Ideally you should find the route cause, which is often some race condition or hardware fault. If it's the same server ever time, turn it off. It would super-swell to run copies of a task simultaneously from the starting line and simply kill the copies after the winner finishes. While is is wasteful in some sense (that is the argument offered for not running speculative execution this way to begin with), it would more precise to say that different users may have different priorities under various use-case scenarios. The wasting of duplicate tasks on extra cores may be an acceptable cost toward the higher priority of minimizing job times for a given application. Is there any notion of this in Hadoop? You can play with the specex parameters, maybe change when they get kicked off. The assumption in the code is that the slowness is caused by H/W problems (especially HDD issues) and it tries to avoid duplicate work. If every Map was duplicated, you'd be doubling the effective cost of each query, and annoying everyone else in the cluster. Plus increased disk and network IO might slow things down. Look at the options, have a play and see. If it doesn't have the feature, you can always try coding it in -if the scheduler API lets it do it, you wont' be breaking anyone else's code. -steve
Re: recommendation on HDDs
On 10/02/11 22:25, Michael Segel wrote: Shrinivas, Assuming you're in the US, I'd recommend the following: Go with 2TB 7200 SATA hard drives. (Not sure what type of hardware you have) What we've found is that in the data nodes, there's an optimal configuration that balances price versus performance. While your chasis may hold 8 drives, how many open SATA ports are on the motherboard? Since you're using JBOD, you don't want the additional expense of having to purchase a separate controller card for the additional drives. I'm not going to disagree about cost, but I will note that a single controller can become a bottleneck once you add a lot of disks to it; it generates lots of interrupts that go to the came core, which then ends up at 100% CPU and overloading. With two controllers the work can get spread over two CPUs, moving the bottlenecks back into the IO channels. For that reason I'd limit the #of disks for a single controller at around 4-6. Remember as well as storage capacity, you need disk space for logs, spill space, temp dirs, etc. This is why 2TB HDDs are looking appealing these days Speed? 10K RPM has a faster seek time and possibly bandwidth but you pay in capital and power. If the HDFS blocks are laid out well, seek time isn't so important, so consider saving the money and putting it elsewhere. The other big question with Hadoop is RAM and CPU, and the answer there is it depends. RAM depends on the algorithm, as can the CPU:spindle ratio ... I recommend 1 core to 1 spindle as a good starting point. In a large cluster the extra capital costs of a second CPU compared to the amount of extra servers and storage that you could get for the same money speaks in favour of more servers, but in smaller clusters the spreadsheets say different things. -Steve (disclaimer, I work for a server vendor :)
Re: recommendation on HDDs
On 12/02/11 16:26, Michael Segel wrote: All, I'd like to clarify somethings... First the concept is to build out a cluster of commodity hardware. So when you do your shopping you want to get the most bang for your buck. That is the 'sweet spot' that I'm talking about. When you look at your E5500 or E5600 chip sets, you will want to go with 4 cores per CPU, dual CPU and a clock speed around 2.53GHz or so. (Faster chips are more expensive and the performance edge falls off so you end up paying a premium.) Interesting choice; the 7 core in a single CPU option is something else to consider. Remember also this is a moving target, what anyone says is valid now (Feb 2011) will be seen as quaint in two years time. Even a few months from now, what is the best value for a cluster will hve moved on. Looking at your disks, you start with using the on board SATA controller. Why? Because it means you don't have to pay for a controller card. If you are building a cluster for general purpose computing... Assuming 1U boxes you have room for 4 3.5 SATA which still give you the best performance for your buck. Can you go with 2.5? Yes, but you are going to be paying a premium. Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You could go with SATA III drives if your motherboard supports the SATA III ports, but you're still paying a slight premium. The OP felt that all he would need was 1TB of disk and was considering 4 250GB drives. (More spindles...yada yada yada...) My suggestion is to forget that nonsense and go with one 2 TB drive because its a better deal and if you want to add more disk to the node, you can. (Its easier to add disk than it is to replace it.) Now do you need to create a spare OS drive? No. Some people who have an internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging there. (Just make sure you have a lot of disk space...) One advantage of a specific drive for OS and log (in a separate partition) is you can re-image it without losing data you care about, and swap in a replacement fast. If you have a small cluster set up for hotswap, that reduces the time a node is down -just have a spare OS HDD ready to put in. OS disks are the ones you care about when they fail, the others are more mildly concerned about the failure rate than something to page you over. The truth is that there really isn't any single *right* answer. There are a lot of options and budget constraints as well as physical constraints like power, space, and location of the hardware. +1. don't forget weight either. Also you may be building out a cluster who's main purpose is to be a backup location for your cluster. So your production cluster has lots of nodes. Your backup cluster has lots of disks per node because your main focus is as much storage per node. So here you may end up buying a 4U rack box, load it up with 3.5 drives and a couple of SATA controller cards. You care less about performance but more about storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know how many you can fit in to a 4U chassis these days. So you have 10 DN backing up a 100+ DN cluster in your main data center. But that's another story. You can get 12 HDDs in a 1U if you ask nicely. but in a small cluster there's a cost, that server can be a big chunk of your filesystem, and if it goes down there's up to 24TB worth of replication going to take place over the rest of the network, so you'll need at least 24TB of spare capacity on the other machines, ignoring bandwidth issues. I think the main take away you should have is that if you look at the price point... your best price per GB is on a 2TB drive until the prices drop on 3TB drives. Since the OP believes that their requirement is 1TB per node... a single 2TB would be the best choice. It allows for additional space and you really shouldn't be too worried about disk i/o being your bottleneck. One less thing to worry about is good.
Re: CUDA on Hadoop
On 09/02/11 17:31, He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! This is nice. If you stick it up online you should link to it from the Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it
Re: hadoop infrastructure questions (production environment)
On 08/02/11 15:45, Oleg Ruchovets wrote: Hi , we are going to production and have some questions to ask: We are using 0.20_append version (as I understand it is hbase 0.90 requirement). 1) Currently we have to process 50GB text files per day , it can grow to 150GB -- what is the best hadoop file size for our load and are there suggested disk block size for that size? depends on the #of machines and their performance. The smaller the blocks, the better the #of maps that can be assigned blocks, but it puts more load on the namenode and job tracker -- We worked using gz and I saw that for every files 1 map task was assigned. What is the best practice: to work with gz files and save disc space or work without archiving ? Hadoop sequence files can be compressed on a per-block basis. It's not as efficient as gz, but reduces your storage and network load. Lets say we want to get performance benefits and disk space is less critical. 2) Currently adding additional machine to the greed we need manually maintain all files and configurations. Is it possible to auto-deploy hadoop servers without the need to manually define each one on all nodes? That's the only way people do it in production clusters: you use Configuration Management (CM) tools. Which one you use is your choice, but do use one. 3) Can we change masters without reinstalling the entire grid -if you can push out a new configuration and restart the workers, you can move the master nodes to any machine in the cluster after a failure. -if you want to leave the nn and JT hostnames the same but change IP addresses, you need to restart all the workers, and make sure the DNS entries of the master nodes are set to expire rapidly so the OS doesn't cache it for long. -if you have machines set up with the same hostname and IP addresses, then you can bring them up as the masters, just have the namenode recover the edit log.
Re: CUDA on Hadoop
On 09/02/11 13:58, Harsh J wrote: You can check-out this project which did some work for Hama+CUDA: http://code.google.com/p/mrcl/ Amazon let you bring up a Hadoop cluster on machines with GPUs you can code against, but I haven't heard of anyone using it. The big issue is bandwidth; it just doesn't make sense for a classic scan through the logs kind of problem as the disk:GPU bandwidth ratio is even worse than disk:CPU. That said, if you were doing something that involved a lot of compute on a block of data (e.g. rendering tiles in a map), this could work.
Re: How to speed up of Map/Reduce job?
On 01/02/11 08:19, Igor Bubkin wrote: Hello everybody I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount example. It takes about 20 sec for processing of 1,5MB text file. We want to use Map/Reduce in real time (interactive: by user's requests). User can't wait for his request 20 sec. This is too long. Is it possible to reduce time of Map/Reduce job? Or may be I misunderstand something? 1. I'd expect a minimum 30s query time due to the way work gets queued and dispatched, JVM startup costs etc. There is no way to eliminate this in Hadoop's current architecture. 2. 1.5M is a very small file size; I'm currently recommending a block size of 512M in new clusters for various reasons. This size of data is just too small to bother with distribution. Load it up into memory; analyse it locally. Things like Apache CouchDB also support MapReduce. Hadoop is not designed for clusters of less than about 10 machines (not enough redundancy of storage), or for small datasets. If your problems aren't big enough, use different tools, because Hadoop contains design decisions and overheads that only make sense once your data is measured in GB and your filesystem in tens to thousands of Terabytes.
Re: Hadoop is for whom? Data architect or Java Architect or All
On 27/01/11 07:28, Manuel Meßner wrote: Hi, you may want to take a look into the streaming api, which allows users to write there map-reduce jobs with any language, which is capable of writing to stdout and reading from stdin. http://hadoop.apache.org/mapreduce/docs/current/streaming.html furthermore pig and hive are hadoop related projects and may be of interest for non java people: http://pig.apache.org/ http://hive.apache.org/ So finally my answer: no it isn't ;) Helps if your ops team have some experience in running java app servers or similar, as well as large linux clusters
Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2
On 27/01/11 10:51, Renaud Delbru wrote: Hi Koji, thanks for sharing the information, Is the 0.20-security branch planned to be a official release at some point ? Cheers If you can play with the beta you can see that it works for you and if not, get bugs fixed during the beta cycle http://people.apache.org/~acmurthy/hadoop-0.20.100-rc0/
Re: Why Hadoop is slow in Cloud
On 20/01/11 23:24, Marc Farnum Rendino wrote: On Wed, Jan 19, 2011 at 2:50 PM, Edward Caprioloedlinuxg...@gmail.com wrote: As for virtualization,paravirtualization,emulation.(whatever ulization) Wow; that's a really big category. There are always a lot of variables, but the net result is always less. It may be 2% 10% or 15%, but it is always less. If it's less of something I don't care about, it's not a factor (for me). On the other hand, if I'm paying less and getting more of what I DO care about, I'd rather go with that. It's about the cost/benefit *ratio*. There's also perf vs storage. On a big cluster, you could add a second Nehalem CPU and maybe get 10-15% boost on throughput, or for the same capex and opex add 10% new servers, which at scale means many more TB of storage and the compute to go with it. The decision rests with the team and their problems.
Re: Why Hadoop is slow in Cloud
On 21/01/11 09:20, Evert Lammerts wrote: Even with performance hit, there are still benefits running Hadoop this way -as you only consume/pay for CPU time you use, if you are only running batch jobs, its lower cost than having a hadoop cluster that is under- used. -if your data is stored in the cloud infrastructure, then you need to data mine it in VMs, unless you want to take the time and money hit of moving it out, and have somewhere to store it. -if the infrastructure lets you, you can lock down the cluster so it is secure. Where a physical cluster is good is that it is a very low cost way of storing data, provided you can analyse it with Hadoop, and provided you can keep that cluster busy most of the time, either with Hadoop work or other scheduled work. If your cluster is idle for computation, you are still paying the capital and (reduced) electricity costs, so the cost of storage and what compute you do effectively increases. Agreed, but this has little to do with Hadoop as a middleware and more to do with the benefits of virtualized vs physical infrastructure. I agree that it is convenient to use HDFS as a DFS to keep your data local to your VMs, but you could choose other DFS's as well. We don't use HDFS, we bring up VMs close to where the data persists. http://www.slideshare.net/steve_l/high-availability-hadoop The major benefit of Hadoop is its data-locality principle, and this is what you give up when you move to the cloud. Regardless of whether you store your data within your VM or on a NAS, it *will* have to travel over a line. As soon as that happens you lose the benefit of data-locality and are left with MapReduce as a way for parallel computing. And in that case you could use less restrictive software, like maybe PBS. You could even install HOD on your virtual cluster, if you'd like the possibility of MapReduce. We don't suffer locality hits so much, but you do pay for the extra infrastructure costs of a more agile datacentre, and if you go to redundancy in hardware over replication, you have less places to run your code. Even on EC2, which doesn't let you tell it what datasets you want to play with for its VM placer to use in its decisions, once data is in the datanodes you do get locality Adarsh, there are probably results around of more generic benchmark tools (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should give you a better idea of the penalties of virtualization. (Our experience with a number of technologies on our OpenNebula cloud is, like Steve points out, that you mainly pay for disk I/O performance.) -would be interesting to see anything you can publish there... I think a decision to go with either cloud or physical infrastructure should be based on the frequency, intensity and types of computation you expect on the short term (that should include operations dealing with data), and the way you think these parameters will develop on a mid-long term. And then compare the prices of a physical cluster that meets those demands (make sure to include power and operations) and the investment you would otherwise need to make in Cloud. +1
Re: namenode format error during setting up hadoop using eclipse in windows 7
On 20/01/11 10:26, arunk786 wrote: Arun K@sairam ~/hadoop-0.19.2 $ bin/hadoop namenode -format cygwin warning: MS-DOS style path detected: C:\cygwin\home\ARUNK~1\HADOOP~1.2\/build/native Preferred POSIX equivalent is: /home/ARUNK~1/HADOOP~1.2/build/native CYGWIN environment variable option nodosfilewarning turns off this warning. Consult the user's guide for more details about POSIX paths: http://cygwin.com/cygwin-ug-net/using.html#using-pathnames bin/hadoop: line 243: /cygdrive/c/Program: No such file or directory 11/01/19 03:53:13 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = sairam/10.0.1.105 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.19.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/b ranch-0.19 -r 789657; compiled by 'root' on Tue Jun 30 12:40:50 EDT 2009 / 11/01/19 03:53:13 ERROR namenode.NameNode: java.io.IOException: javax.security.a uth.login.LoginException: Login failed: Expect one token as the result of whoami : sairam\arun k */ Looks like Hadoop doesn't expect a space in the user name. You could file a bug, though I doubt anyone is going to sit down and fix it other than you, that being the way of open source.
Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?
On 18/01/11 19:58, Koji Noguchi wrote: Try moving up to v 6.1.25, which should be more straightforward. FYI, when we tried 6.1.25, we got hit by a deadlock. http://jira.codehaus.org/browse/JETTY-1264 Koji Interesting. Given that there is now 6.1.26 out, that would be the one to play with. Thanks for the heads up, I will move my code up to the .26 release, -steve
Re: Why Hadoop is slow in Cloud
On 17/01/11 04:11, Adarsh Sharma wrote: Dear all, Yesterday I performed a kind of testing between *Hadoop in Standalone Servers* *Hadoop in Cloud. *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in which one node act as Master ( Namenode , Jobtracker ) and the remaining nodes act as slaves ( Datanodes, Tasktracker ). On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made one Standalone Machine as *Hadoop Master* and the slaves are configured on the VM's in Cloud. I am confused about the stats obtained after the testing. What I concluded that the VM are giving half peformance as compared with Standalone Servers. Interesting stats, nothing that massively surprises me, especially as your benchmarks are very much streaming through datasets. If you were doing something more CPU intensive (graph work, for example), things wouldn't look so bad I've done stuff in this area. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud I am expected some slow down but at this level I never expect. Would this is genuine or there may be some configuration problem. I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in Standalone Servers. Please have a look on the results and if interested comment on it. The big killer here is File IO, with today's HDD controllers and virtual filesystems, disk IO is way underpowered compared to physical disk IO. Networking is reduced (but improving), and CPU can be pretty good, but disk is bad. Why? 1. Every access to a block in the VM is turned into virtual disk controller operations which are then interpreted by the VDC and turned into reads/writes in the virtual disk drive 2. which is turned into seeks, reads and writes in the physical hardware. Some workarounds -allocate physical disks for the HDFS filesystem, for the duration of the VMs. -have the local hosts serve up a bit of their filesystem on a fast protocol (like NFS), and have every VM mount the local physical NFS filestore as their hadoop data dirs.
Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?
On 16/01/11 09:41, xiufeng liu wrote: Hi, In my cluster, Hadoop somehow cannot work, and I found that it was due to the Jetty-6.1.14 which is not able to start up. However, Jetty 7 can work in my cluster. Could any body know how to replace Jetty6.1.14 with Jetty7? Thanks afancy The switch to jetty 7 will not be easy, and I wouldn't encourage you to do it unless you want to get into editing the Hadoop source, retesting everything, Try moving up to v 6.1.25, which should be more straightforward. Replace the JAR, QA the cluster with some terasorting.
Re: TeraSort question.
On 11/01/11 16:40, Raj V wrote: Ted Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., For legal reasons, I really don't want to send the complete job histiory files. My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. They can be different. The JT pushes out work to machines when they report in, some may get more work than others, so generate more local data. This will have follow-on consequences. In a live system things are different as the work tends to follow the data, so machines with (or near) the data you need get the work. It's a really hard thing to say is the cluster working right, when bringing it up, everyone is really guessing about expected performance. -Steve
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
On 13/01/11 08:34, li ping wrote: That is also my concerns. Is it efficient for data transmission. It's long lived TCP connections, reasonably efficient for bulk data xfer, has all the throttling of TCP built in, and comes with some excellently debugged client and server code in the form of jetty and httpclient. In maintenance costs alone, those libraries justify HTTP unless you have a vastly superior option *and are willing to maintain it forever* FTPs limits are well known (security), NFS limits well known (security, UDP version doesn't throttle), self developed protocols will have whatever problems you want. There are better protocols for long-haul data transfer over fat pipes, such as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP channels in parallel to reduce the impact of a single lost packet, but within a datacentre, you shouldn't have to worry about this. If you do find lots of packets get lost, raise the issue with the networking team. -Steve On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com wrote: Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan
Re: Hadoop Certification Progamme
On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Well, there's always providing enough patches to the code to get commit rights :)
Re: Hadoop/Elastic MR on AWS
On 10/12/10 06:14, Amandeep Khurana wrote: Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. How different is your distro from the apache version?
Re: Question from a Desperate Java Newbie
On 10/12/10 09:08, Edward Choi wrote: I was wrong. It wasn't because of the read once free policy. I tried again with Java first again and this time it didn't work. I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks! httpclient is good, HtmlUnit has a very good client that can simulate things like a full web browser with cookies, but that may be overkill. NYT's read once policy uses cookies to verify that you are there for the first day not logged in, for later days you get 302'd unless you delete the cookie, so stateful clients are bad. What you may have been hit by is whatever robot trap they have -if you generate too much load and don't follow the robots.txt rules they may detect this and push back
Re: Hadoop/Elastic MR on AWS
On 09/12/10 18:57, Aaron Eng wrote: Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported increased latency on their us-east-1 region. In reality, increased latency means 50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... that's the harsh reality of all VMs. you need to monitor and stamp on things that misbehave. The nice thing is: it's easy to do this, just get HTTP status pages and kill any VM This is not a fault of EC2: any VM infra has this feature. You can't control where your VMs come up, you are penalised by other cpu-heavy machines on the same server, amazon throttle the smaller machines a bit. But you -don't pay for cluster time you don't need -don't pay for ingress/egress for data you generate in the vendor's infrastructure (just storage) -can be very agile with cluster size. I have a talk on this topic for the curious, discussing a UI that is a bit more agile, but even there we deploy agents to every node to keep an eye on the state of the cluster. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud http://blip.tv/file/3809976 Hadoop is designed to work well in a large-scale static cluster: fixed machines, with the reactions to client to server failure failure: spin and those of servers -blacklist clients- being the right ones to leave ops in control. In a virtual world you want the clients to see (somehow) if the master nodes have moved, you want the servers to kill the misbehaving VMs to save money, and then create new ones. -Steve