Re: Avro, Hadoop0.20.2, Jackson Error
Does it still happen if you configure avro-tools to use dependency groupIdorg.apache.avro/groupId artifactIdavro-tools/artifactId version1.6.3/version classifiernodeps/classifier /dependency ? You have two hadoop's, two jacksons, and even two avro:avro artifacts in your classpath if you use the avro bundle jar with a default classifier. avro-tools jar is not intended for inclusion in a project, as it is a jar with dependencies inside. https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD ocumentation-ProjectStructure On 3/26/12 7:52 PM, Deepak Nettem deepaknet...@gmail.com wrote: When I include some Avro code in my Mapper, I get this error: Error: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F eature;)Lorg/codehaus/jackson/JsonFactory; Particularly, just these two lines of code: InputStream in = getClass().getResourceAsStream(schema.avsc); Schema schema = Schema.parse(in); This code works perfectly when run as a stand alone application outside of Hadoop. Why do I get this error? and what's the best way to get rid of it? I am using Hadoop 0.20.2, and writing code in the new API. I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar and jackson-mapper-asl-1.0.1.jar. I removed these, but got this error: hadoop Exception in thread main java.lang. NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException I am using Maven as a build tool, and my pom.xml has this dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.5.2/version scopecompile/scope /dependency I added the dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.5.2/version scopecompile/scope /dependency But that still gives me this error: Error: org.codehaus.jackson. JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus /jackson/JsonFactory; - I also tried replacing the earlier dependencies with these: dependency groupIdorg.apache.avro/ groupId artifactIdavro-tools/artifactId version1.6.3/version /dependency dependency groupIdorg.apache.avro/groupId artifactIdavro/artifactId version1.6.3/version /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version scopecompile/scope /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.8.8/version scopecompile/scope /dependency And this is my app dependency tree: [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest --- [INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT [INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile) [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile [INFO] | +- commons-beanutils:commons-beanutils:jar:1.8.0:compile [INFO] | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | +- commons-lang:commons-lang:jar:2.4:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile [INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile [INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile [INFO] +- org.apache.avro:avro:jar:1.6.3:compile [INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile [INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile [INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile [INFO]+- commons-cli:commons-cli:jar:1.2:compile [INFO]+- xmlenc:xmlenc:jar:0.52:compile [INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile [INFO]+- commons-codec:commons-codec:jar:1.3:compile [INFO]+- commons-net:commons-net:jar:1.4.1:compile [INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile [INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile [INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile [INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile [INFO]| \- ant:ant:jar:1.6.5:compile [INFO]+- commons-el:commons-el:jar:1.0:compile [INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile [INFO]+- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile [INFO]+- net.sf.kosmosfs:kfs:jar:0.3:compile [INFO]+- hsqldb:hsqldb:jar:1.8.0.10:compile [INFO]+- oro:oro:jar:2.0.8:compile [INFO]\- org.eclipse.jdt:core:jar:3.1.1:compile I still get the same error. Somebody please please help me with this. I need to resolve this asap!! Best, Deepak
Re: HDFS Backup nodes
On 12/13/11 11:00 PM, M. C. Srivas mcsri...@gmail.com wrote: Suresh, As of today, there is no option except to use NFS. And as you yourself mention, the first HA prototype when it comes out will require NFS. How will it 'require' NFS? Won't any 'remote, high availability storage' work? NFS is unreliable unless in my experience unless: * Its a Netapp * Its based on Solaris (caveat: I have only used 5 NFS solution types over the last decade, and the issues are not data integrity, rather availability from a client perspective) A solution with a brief 'stall' in service while a SAN mount switched over or similar with drbd should be possible and data safe, if this is being built to truly 'require' NFS that is no better for me than the current situation, which we manage using OS level tools for failover that will temporarily break clients but resume availability quickly thereafter. Where I would like the most help from hadoop is in making the failover transparent to clients, not in solving the reliable storage problem or failover scenarios that Storage and OS vendors do. (a) I wasn't aware that Bookkeeper had progressed that far. I wonder whether it would be able to keep up with the data rates that is required in order to hold the NN log without falling behind. (b) I do know Karthik Ranga at FB just started a design to put the NN data in HDFS itself, but that is in very preliminary design stages with no real code there. The problem is that the HA code written with NFS in mind is very different from the HA code written with HDFS in mind, which are both quite different from the code that is written with Bookkeeper in mind. Essentially the three options will form three different implementations, since the failure modes of each of the back-ends are different. Am I totally off base? thanks, Srivas. On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas sur...@hortonworks.comwrote: Srivas, As you may know already, NFS is just being used in the first prototype for HA. Two options for editlog store are: 1. Using BookKeeper. Work has already completed on trunk towards this. This will replace need for NFS to store the editlogs and is highly available. This solution will also be used for HA. 2. We have a short term goal also to enable editlogs going to HDFS itself. The work is in progress. Regards, Suresh -- Forwarded message -- From: M. C. Srivas mcsri...@gmail.com Date: Sun, Dec 11, 2011 at 10:47 PM Subject: Re: HDFS Backup nodes To: common-user@hadoop.apache.org You are out of luck if you don't want to use NFS, and yet want redundancy for the NN. Even the new NN HA work being done by the community will require NFS ... and the NFS itself needs to be HA. But if you use a Netapp, then the likelihood of the Netapp crashing is lower than the likelihood of a garbage-collection-of-death happening in the NN. [ disclaimer: I don't work for Netapp, I work for MapR ] On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote: Thanks Joey. We've had enough problems with nfs (mainly under very high load) that we thought it might be riskier to use it for the NN. randy On 12/07/2011 06:46 PM, Joey Echeverria wrote: Hey Rand, It will mark that storage directory as failed and ignore it from then on. In order to do this correctly, you need a couple of options enabled on the NFS mount to make sure that it doesn't retry infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10 options set. -Joey On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net wrote: What happens then if the nfs server fails or isn't reachable? Does hdfs lock up? Does it gracefully ignore the nfs copy? Thanks, randy - Original Message - From: Joey Echeverriaj...@cloudera.com To: common-user@hadoop.apache.org Sent: Wednesday, December 7, 2011 6:07:58 AM Subject: Re: HDFS Backup nodes You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar praveen...@gmail.com wrote: This means still we are relying on Secondary NameNode idealogy for Namenode's backup. Can OS-mirroring of Namenode is a good alternative keep it alive all the time ? Thanks, Praveenesh On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: AFAIK backup node introduced in 0.21 version onwards. __**__ From: praveenesh kumar [praveen...@gmail.com] Sent: Wednesday, December 07, 2011 12:40 PM To: common-user@hadoop.apache.org Subject: HDFS Backup nodes Does hadoop 0.20.205 supports configuring HDFS backup nodes ? Thanks, Praveenesh -- Joseph
Re: More cores Vs More Nodes ?
On 12/14/11 9:05 AM, Michael Segel michael_se...@hotmail.com wrote: Brian, I think you missed my point. The moment you go and design a cluster for a specific job, you end up getting fscked because there's another group who wants to use the shared resource for their job which could be orthogonal to the original purpose. It happens everyday. This is why you have to ask if the cluster is being built for a specific purpose. Meaning answering the question 'Which of the following best describes your cluster: a) PoC b) Development c) Pre-prod d) Production e) Secondary/Backup Note that sizing the cluster is a different matter. Meaning if you know you need a PB of storage, you're going to design the cluster differently because once you get to a certain size, you have to recognize that your clusters are going to have lots of disk, require 10GBe just for the storage. Number of cores would be less of an issue, however again look at pricing. 2 socket 8 core Xeon MBs are currently at an optimal price point. Recently, single socket servers have been 9 to 12 months ahead of the curve on next generation processor availability. I found 1 socket quad core Xeon a better value because a single socket 4 core system performs at the CPU level of ~5.5 cores of a dual socket system due to faster Ghz and newer generation processors on the single socket system -- At least earlier this year. Sandy Bridge is finally moving to dual socket. Single socket quad core Xeon at 3.4Ghz is much more than half as capable as dual socket quad @2.66Ghz. 1 socket versus 2 is a moving target. In our case, we had a $ budget and a low power/rack capacity. We compared what we could get for various designs in terms of: aggregate CPU (CPU core count * Ghz) aggregate Memory bandwidth aggregate RAM aggregate Disk capacity aggregate network throughput And chose the single socket, 1U system based on our constraints and what we could get with a variety of designs (all single socket or dual socket, 1U and 2U nodes, 4 to 12 drives / node). We had a range of acceptable Storage to CPU ratio, CPU to RAM ratio, and network to storage ratio. With fewer CPU we had fewer disk and less RAM per machine, but more total servers. This was also influenced by availability concerns -- the more disk per node, the faster your network per node needs to be in order to replicate on a failure. Smaller servers meant significantly cheaper network since bonded 1Gb link pairs were good enough. Given various constraints and needs different organizations will find different sweet spots. And given the hardware available at the time, the sweet spot moves as well. And again this goes back to the point I was trying to make. You need to look beyond the number of cores as a determining factor. You go too small, you're going to take a hit because of the price/performance curve. (Remember that you have to consider Machine Room real estate. 100 2 core boxes take up much more space than 25 8 core boxes) If you go to the other extreme... 64 core giant SMP box $ for $$$ (less money) build out an 8 node cluster. Beyond that, you really, really don't want to build a custom cluster for a specific job unless you know that you're going to be running that specific job or set of jobs (24x7X365) [And yes, I came across such a use case...] HTH -Mike From: bbock...@cse.unl.edu Subject: Re: More cores Vs More Nodes ? Date: Wed, 14 Dec 2011 07:41:25 -0600 To: common-user@hadoop.apache.org Actually, there are varying degrees here. If you have a successful project, you will find other groups at your door wanting to use the cluster too. Their jobs might be different from the original use case. However, if you don't understand the original use case (CPU heavy or storage heavy? is a great beginning question), your original project won't be successful. Then there will be no follow-up users because you failed. So, you want to have a reasonably general-purpose cluster, but make sure it matches well with the type of jobs. As an example, we had one group who required an estimated CPU-millenia per byte of dataŠ they needed a general purpose cluster for a certain value of general purpose. Brian On Dec 14, 2011, at 7:29 AM, Michael Segel wrote: Aw Tommy, Actually no. You really don't want to do this. If you actually ran a cluster and worked in the real world, you would find that if you purposely build a cluster for one job, there will be a mandate that some other group needs to use the cluster and that their job has different performance issues and your cluster is now suboptimal for their jobs... Perhaps you meant that you needed to think about the purpose of the cluster? That is do you want to minimize the nodes but maximize the disk space per node and use the cluster as your backup cluster? (Assuming that you are considering your DR and BCP in your design.) The problem with your answer, is that a job has a specific meaning within
Re: Read() block mysteriously when using big BytesPerChecksum size
On Oct 7, 2010, at 2:35 AM, elton sky wrote: Hello experts, I was benchmarking sequential write throughput of HDFS. For testing affect of bytesPerChecksum (bpc) size to write performance, I am using different bpc size: 2M, 256K, 32K, 4K, 512B. My cluster has 1 name node and 5 data nodes. They are xen VMs and each of them configured with 56MB/s duplex ethernet connection. I I try to create a 10G file with different bpc. When bpc is 2M, the throughput drops dramatically compared with others: time(ms): 333008 bpc: 2M time(ms): 234180 bpc: 256K time(ms): 223737 bpc: 32K time(ms): 228842 bpc: 4K time(ms): 228238 bpc: 512 After dig into the source, I found the problem happens on data nodes. In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(): private int readNextPacket() throws IOException { ... while (buf.remaining() SIZE_OF_INTEGER) { if (buf.position() 0) { shiftBufData(); } * readToBuf(-1); // this line takes 30ms or more for each packet before returns* } ... while (toRead 0) { //this loop also takes around 30 ms toRead -= readToBuf(toRead); } ... } private long readToBufTime(int toRead) throws IOException { ... *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the line actually causes the delay* ... } The *in.read() *takes around 30ms to wait for data before it returns. And when it returns it reads a few KBs data. The while loop comes later takes similar time to finish, which reads (2MB - a few KBs reads before). I couldn't understand the reason for the pause of *in.read()*. Why data node needs to wait? why data is not available then? It is probably waiting on disk or network. Why this happens when using big bpc? Linux tends to asynchronously 'read-ahead' from disks if sequential access is detected in a file. The default is to read-ahead in chunks of up to 128K. You can change this on a per device level with blockdev --setra (google it). Since Hadoop fetches data in a synchronous loop, it loses the benefit of the OS asynchronous read-ahead past 128K unless you change that setting. I recommend a readahead value of ~2MB for today's SATA drives if you need top sequential access performance from linux. This would look something like this for 2MB: # blockdev --setra 4096 /dev/sda any idea will be appreciated!
Re: Hadoop performance - xfs and ext4
Did you try the XFS 'allocsize' mount parameter (for example, allocsize=8m)? This will reduce fragmentation during concurrent writes. Its more complicated, but using separate partitions for temp space versus HDFS also has an effect. XFS isn't as good with the temp space. In short, a single test with default configurations is useful, but doesn't complete the picture. Both file systems have several important tuning knobs. On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote: Hi, I've been tweaking our cluster roll-out process to refine it. While doing so, I decided to check if XFS gives any performance benefit over EXT4. As per a comment I read somewhere on the hbase wiki - XFS makes for faster formatting of filesystems (it takes us 5.5 minutes to rebuild a datanode from bare metal to a full Hadoop config on top of Debian Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes). However, TeraSort performance on a cluster of 45 of these data-nodes shows XFS is slower (same configuration settings on both installs other than changed filesystem), specifically, mkfs.xfs -f -l size=64m DEV (mounted with noatime,nodiratime,logbufs=8) gives me a cluster which runs TeraSort in about 23 minutes mkfs.ext4 -T largefile4 DEV (mounted with noatime) gives me a cluster which runs TeraSort in about 18.5 minutes So I'll be rolling our cluster back to EXT4, but thought the information might be useful/interesting to others. -stephen XFS config chosen from notes at http://everything2.com/index.pl?node_id=1479435 -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Hadoop performance - xfs and ext4
Ah, one more thing. With XFS there is an online defragmenter -- it runs every night on my cluster. Performance on a fresh, empty system will not match a used one that has become fragmented. On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote: Hi, I've been tweaking our cluster roll-out process to refine it. While doing so, I decided to check if XFS gives any performance benefit over EXT4. As per a comment I read somewhere on the hbase wiki - XFS makes for faster formatting of filesystems (it takes us 5.5 minutes to rebuild a datanode from bare metal to a full Hadoop config on top of Debian Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes). However, TeraSort performance on a cluster of 45 of these data-nodes shows XFS is slower (same configuration settings on both installs other than changed filesystem), specifically, mkfs.xfs -f -l size=64m DEV (mounted with noatime,nodiratime,logbufs=8) gives me a cluster which runs TeraSort in about 23 minutes mkfs.ext4 -T largefile4 DEV (mounted with noatime) gives me a cluster which runs TeraSort in about 18.5 minutes So I'll be rolling our cluster back to EXT4, but thought the information might be useful/interesting to others. -stephen XFS config chosen from notes at http://everything2.com/index.pl?node_id=1479435 -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Extremely slow HDFS after upgrade
All links check out as full duplex gigabit with ifconfig and ethtool. Ifconfig reports no dropped packets or retransmits. A tcpdump shows no retransmits, but shows significantly smaller packets and fewer outstanding packets between acks. But iperf in UDP mode shows a consistent 1.5% lost datagram rate in one direction -- I can transmit 900Mbits/sec + with 1.5% loss by udp over the flawed links, 0% loss the other way. So it appears that the Linux tcp flow control is throttling back the window size due to these losses. Time to have someone replace all the network cables. Thanks for the ideas Todd, -Scott On Apr 16, 2010, at 8:25 PM, Todd Lipcon wrote: Checked link autonegotiation with ethtool? Sometimes gige will autoneg to 10mb half duplex if there's a bad cable, NIC, or switch port. -Todd On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote: More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com wrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think
Extremely slow HDFS after upgrade
I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: 0.678296742615553 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 0.6782967448234558 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: 9.568803140552889E-5 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913 Good performance cluster (other jobs active during the test): - hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively TestFDSIO.0.0.4 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098 [snip] 10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 12:51:32 INFO mapred.FileInputFormat:Date time: Fri Apr 16 12:51:32 PDT 2010 10/04/16 12:51:32 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 12:51:32 INFO mapred.FileInputFormat: Throughput mb/sec: 92.47699634715865 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: 92.47699737548828
Re: Extremely slow HDFS after upgrade
Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391 [snip] 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date time: Fri Apr 16 13:43:07 PDT 2010 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput
Re: Extremely slow HDFS after upgrade
More info -- this is not a Hadoop issue. The network performance issue can be replicated with SSH only on the links where Hadoop has a problem, and only in the direction with a problem. HDFS is slow to transfer data in certain directions from certain machines. So, for example, copying from node C to D may be slow, but not the other direction from C to D. Likewise, although only 3 of 8 nodes have this problem, it is not universal. For example, node C might have trouble copying data to 5 of the 7 other nodes, and node G might have trouble with all 7 other nodes. No idea what it is yet, but SSH exhibits the same issue -- only in those specific point-to-point links in one specific direction. -Scott On Apr 16, 2010, at 7:10 PM, Scott Carey wrote: Ok, so here is a ... fun result. I have dfs.replication.min set to 2, so I can't just do hsdoop fs -Ddfs.replication=1 put someFile someFile Since that will fail. So here are two results that are fascinating: $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar real1m53.237s user0m1.952s sys 0m0.308s $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar real0m1.689s user0m1.763s sys 0m0.315s The file is 77MB and so is two blocks. The test with replication level 3 is slow about 9 out of 10 times. When it is slow it sometimes is 28 seconds, sometimes 2 minutes. It was fast one time... The test with replication level 2 is fast in 40 out of 40 tests. This is a development cluster with 8 nodes. It looks like the replication level of 3 or more causes trouble. Looking more closely at the logs, it seems that certain datanodes (but not all) cause large delays if they are in the middle of an HDFS write chain. So, a write that goes from A B C is fast if B is a good node and C a bad node. If its A C B then its slow. So, I can say that some nodes but not all are doing something wrong. when in the middle of a write chain. If I do a replication = 2 write on one of these bad nodes, its always slow. So the good news is I can identify the bad nodes, and decomission them. The bad news is this still doesn't make a lot of sense, and 40% of the nodes have the issue. Worse, on a couple nodes the behavior in the replication = 2 case is not consistent -- sometimes the first block is fast. So it may be dependent on not just the source, but the source target combination in the chain. At this point, I suspect something completely broken at the network level, perhaps even routing. Why it would show up after an upgrade is yet to be determined, but the upgrade did include some config changes and OS updates. Thanks Todd! -Scott On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote: Hey Scott, This is indeed really strange... if you do a straight hadoop fs -put with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote: I have two clusters upgraded to CDH2. One is performing fine, and the other is EXTREMELY slow. Some jobs that formerly took 90 seconds, take 20 to 50 minutes. It is an HDFS issue from what I can tell. The simple DFS benchmark with one map task shows the problem clearly. I have looked at every difference I can find and am wondering where else to look to track this down. The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum with a 'dd' write test. top / iostat do not show any significant CPU usage or iowait times on any machines in the cluster during the test. ifconfig does not report any dropped packets or other errors on any machine in the cluster. dmesg has nothing interesting. The poorly performing cluster is on a slightly newer CentOS version: Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.4, recent patches) Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux (CentOS 5.3, I think) The performance is always poor, not sporadically poor. It is poor with M/R tasks as well as non-M/R HDFS clients (i.e. sqoop). Poor performance cluster (no other jobs active during the test): --- $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write -nrFiles 1 -fileSize 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega bytes, 1 files 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/16 12:53:15
Re: swapping on hadoop
On Apr 1, 2010, at 5:04 PM, Vasilis Liaskovitis wrote: ok. Now, considering a map side space buffer and a sort buffer, do both account for tenured space for both map and reduce JVMs? I 'd think the map side buffer gets used and tenured for map tasks and the sort space gets used and tenured for the reduce task during sort/merge phase. Would both spaces really be used in both kinds of tasks? It is my understanding that a JVM used for a map won't also be used for a reduce. The JVM reuse runs multiple maps or reduces in one process but not across both. The mapper does the majority of the sorting, the reducer mostly merges pre-sorted data. Each kind of task tends to have a different memory footprint, dependent on the job and data. The maximum number of map and reduce tasks per node applies no matter how many jobs are running. RIght. But depending on your job scheduler, isn't it possible that you may be swapping the different jobs' JVM space in and out of physical memory while scheduling all the parallel jobs? Especially if nodes don't have huge amounts of memory, this scenario sounds likely. To be more precise, the max number of map and reduce tasks corresponds with the maximum number of active JVMs of each type at the same time. When a job finishes all of its tasks, the JVMs for it are killed. A new job gets new JVMs. Running concurrent jobs means that each job has some fraction of these JVM slots occupied. So, there should be no swapping different Jobs JVMs in and out of RAM. The same number of active JVM's exists for one large job as it does for 4 concurrent jobs. Back to a single job running and assuming all heap space being used, what percentage of a node's memory would you leave for other functions esp. disk cache? I currently only have 25% of memory (~4GB) for non-heapJVM data; I guess there should be a sweet-spot, probably dependent on the job I/O characteristics. It will dependon the job, its I/O, and the OS tuning. But 25% to 33% of memory for system file cache has worked for me (remember, the nodes aren't just for tasks, but also for HDFS). A small amount of swap-out isn't bad, since the JVM's expire and never swap-in. - Vasilis
Re: OutOfMemoryError: Cannot create GC thread. Out of system resources
The default size of Java's young GC generation is 1/3 of the heap. (-XX:NewRatio defaults to 2) You have told it to use 100MB for in memory file system. There is a default setting of 64MB sort space. if -Xmx is 128M then the above sums to over 200MB and won't fit. Turning down the use of any of the three above could help, or increasing -Xmx. Additionally, when a thread can't be allocated it could potentially be due to a limit on the OS side for file system handles per process or user. On Mar 31, 2010, at 11:48 AM, Edson Ramiro wrote: Hi all, When I run the pi Hadoop sample I get this error: 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stdout 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stderr 10/03/31 15:46:20 INFO mapred.JobClient: Task Id : attempt_201003311545_0001_m_06_1, Status : FAILED java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) May be its because the datanode can't create more threads. ram...@lcpad:~/hadoop-0.20.2$ cat logs/userlogs/attempt_201003311457_0001_r_01_2/stdout # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: Cannot create GC thread. Out of system resources. # # Internal Error (gcTaskThread.cpp:38), pid=28840, tid=140010745776400 # Error: Cannot create GC thread. Out of system resources. # # JRE version: 6.0_17-b04 # Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode linux-amd64 ) # An error report file with more information is saved as: # /var-host/tmp/hadoop-ramiro/mapred/local/taskTracker/jobcache/job_201003311457_0001/attempt_201003311457_0001_r_01_2/work/hs_err_pid28840.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # I configured the limits bellow, but I'm still getting the same error. property namefs.inmemory.size.mb/name value100/value /property property namemapred.child.java.opts/name value-Xmx128M/value /property Do you know what limit should I configure to fix it? Thanks in Advance Edson Ramiro
Re: swapping on hadoop
On Apr 1, 2010, at 8:38 AM, Vasilis Liaskovitis wrote: In this example, what hadoop config parameters do the above 2 buffers refer to? io.sort.mb=250, but which parameter does the map side join 100MB refer to? Are you referring to the split size of the input data handled by a single map task? Apart from that question, the example is clear to me and useful, thanks. Map side join in just an example of one of many possible use cases where a particular map implementation may hold on to some semi-permanent data for the whole task. It could be anything that takes 100MB of heap and holds the data across individual calls to map(). Quoting Allen: Java takes more RAM than just the heap size. Sometimes 2-3x as much. Is there a clear indication that Java memory usage extends so far beyond its allocated heap? E.g. would java thread stacks really account for such a big increase 2x to 3x? Tasks seem to be heavily threaded. What are the relevant config options to control number of threads within a task? Java typically uses 5MB to 60MB for classloader data (statics, classes) and some space for threads, etc. The default thread stack on most OS's is about 1MB, and the number of threads for a task process is on the order of a dozen. Getting 2-3x the space in a java process outside the heap would require either a huge thread count, a large native library loaded, or perhaps a non-java hadoop job using pipes. It would be rather obvious in 'top' if you sort by memory (shift-M on linux), or vmstat, etc. To get the current size of the heap of a process, you can use jstat or 'kill -3' to create a stack dump and heap summary. With this new setup, I don't normally get swapping for a single job e.g. terasort or hive job. However, the problem in general is exacerbated if one spawns multiple indepenendent hadoop jobs simultaneously. I 've noticed that JVMs are not re-used across jobs, in an earlier post: http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html This implies that Java memory usage would blow up when submitting multiple independent jobs. So this multiple job scenario sounds more susceptible to swapping The maximum number of map and reduce tasks per node applies no matter how many jobs are running. A relevant question is: in production environments, do people run jobs in parallel? Or is it that the majority of jobs is a serial pipeline / cascade of jobs being run back to back? Jobs are absolutely run in parallel. I recommend using the fair scheduler with no config parameters other than 'assignmultiple = true' as the 'baseline' scheduler, and adjust from there accordingly. The Capacity Scheduler has more tuning knobs for dealing with memory constraints if jobs have drastically different memory needs. The out-of-the-box FIFO scheduler tends to have a hard time keeping the cluster utilization high when there are multiple jobs to run. thanks, - Vasilis
Re: swapping on hadoop
On Linux, check out the 'swappiness' OS tunable -- you can turn this down from the default to reduce swapping at the expense of some system file cache. However, you want a decent chunk of RAM left for the system to cache files -- if it is all allocated and used by Hadoop there will be extra I/O. For Java GC, if your -Xmx is above 600MB or so, try either changing -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the -XX:MaxNewSize parameter to around 150MB to 250MB. An example of Hadoop memory use scaling as -Xmx grows: Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop sort space. Both of these chunks of data will eventually get pushed to the tenured generation. So, the actual heap required will end up close to: (Size of young generation) + 100MB + 250MB + misc. The default size of the young generation is 1/3 of the heap. So, at -Xmx750M this job will probably use a minimum of 600MB of java heap, plus about 50MB non-heap if this is a pure java job. Now, perhaps due to some other jobs you want to set -Xmx1200M. The above job will end up using about 150MB more now, because the new space has grown, although the footprint is the same. A larger new space can improve performance, but with most typical hadoop jobs it won't. Making sure it does not grow larger just because -Xmx is larger can help save a lot of memory. Additionally, a job that would have failed with an OOME at -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space. If you are using a 64 bit JRE, you can also save space with the -XX:+UseCompressedOops option -- sometimes quite a bit of space. On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote: Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping behaviour on both terasort jobs and hive query jobs. - Focusing on a single job config, Is there a rule of thumb about how much node memory should be left for use outside of Child JVMs? I make sure that per Node, there is free memory: (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize PhysicalMemoryonNode The total JVM heap size per node per job from the above equation currently account 65%-75% of the node's memory. (I 've tried allocating a riskier 90% of the node's memory, with similar swapping observations). - Could there be an issue with HDFS data or metadata taking up memory? I am not cleaning output or intermediate outputs from HDFS between runs. Is this possible? - Do people use any specific java flags (particularly garbage collection flags) for production environments where one job runs (or possibly more jobs run simultaneously) ? - What are the memory requirements for the jobtracker,namenode and tasktracker,datanode JVMs? - I am setting io.sort.mb to about half of the JVM heap size (half of -Xmx in javaopts). Should this be set to a different ratio? (this setting doesn't sound like it should be causing swapping in the first place). - The buffer cache is cleaned before each run (flush and echo 3 /proc/sys/vm/drop_caches) any empirical advice and suggestions to solve this are appreciated. thanks, - Vasilis
Re: where does jobtracker get the IP and port of namenode?
On Mar 8, 2010, at 11:38 PM, jiang licht wrote: I guess my confusion is this: I point fs.default.name to hdfs:A:50001 in core-site.xml (A is IP address). I assume when tasktracker starts, it should use A:50001 to contact namenode. But actually, tasktracker log shows that it uses B which is IP address of another network interface of the namenode box and because the tasktracker box cannot reach address B, the tasktracker simply retries connection and finally fails to start. I read some source code in org.apache.hadoop.hdfs.DistributedFileSystem.initialize and it seems to me the namenode address is passed in earlier from what is specified in fs.default.name. Is this correct that the namenode address used here by tasktracker comes from fs.default.name in core-site.xml or somehow there is another step in which this value is changed? Could someone elaborate this process how tasktracker resolves namenode and contacts it? Thanks! Hadoop is rather annoyingly strict on how dns and reverse dns are aligned. I'm not sure if it applies to your specific problem, but: Even if configured to talk to A, if A is an IP address, in some places it will reverse-dns that IP, then dns resolve the resolved name. So if IP A maps by reverse dns (via dns or a hosts file or whatever) to name FOO, and FOO resolves to IP address B, then that is likely your problem. datanodes/namenodes with multiple ip addresses often have problems like this. I wish that if you configured it to 'talk to IP address A' all it did was try and talk to IP address A, but thats not how it works. I'm used to seeing this as a datanode network configuration problem, not a namenode problem. But you mention that the server has more than one network interface, so it may be related. Thanks, Michael --- On Tue, 3/9/10, jiang licht licht_ji...@yahoo.com wrote: From: jiang licht licht_ji...@yahoo.com Subject: Re: where does jobtracker get the IP and port of namenode? To: common-user@hadoop.apache.org Date: Tuesday, March 9, 2010, 12:20 AM Sorry, that was a typo in my first post. I did use 'fs.default.name' in core-site.xml. BTW, the following is the list of error message when tasktracker was started and shows that tasktracker failed to connect to namenode A:50001. / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = HOSTNAME/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1+169.56 STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 / 2010-03-09 00:08:50,199 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2010-03-09 00:08:50,341 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2010-03-09 00:08:50,350 INFO org.mortbay.log: jetty-6.1.14 2010-03-09 00:08:50,707 INFO org.mortbay.log: Started selectchannelconnec...@0.0.0.0:50060 2010-03-09 00:08:50,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=TaskTracker, sessionId= 2010-03-09 00:08:50,749 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=TaskTracker, port=52550 2010-03-09 00:08:50,799 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 52550: starting 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 52550: starting 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 52550: starting 2010-03-09 00:08:50,801 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 52550: starting 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: HOSTNAME/127.0.0.1:52550 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550 2010-03-09 00:08:50,802 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 52550: starting 2010-03-09 00:08:50,854 INFO org.apache.hadoop.mapred.TaskTracker: Using MemoryCalculatorPlugin : org.apache.hadoop.util.linuxmemorycalculatorplu...@27b4c1d7 2010-03-09 00:08:50,856 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550 2010-03-09 00:08:50,858 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's
Re: Pipelining data from map to reduce
Interesting article. It claims to have the same fault tolerance but I don't see any explanation of how that can be. If a single mapper fails part-way through a task when it has transmitted partial results to a reducer, the whole job is corrupted. With the current barrier between map and reduce, a job can recover from partially completed tasks and speculatively execute. I would imagine that small low latency tasks can benefit greatly from such an approach, but larger tasks need the barrier or will not be very fault tolerant. However, there is still a lot of optimizations to dot in Hadoop for low latency tasks while maintaining the barrier. On Mar 4, 2010, at 2:18 PM, Jeff Hammerbacher wrote: Also see Breaking the MapReduce Stage Barrier from UIUC: http://www.ideals.illinois.edu/bitstream/handle/2142/14819/breaking.pdf On Thu, Mar 4, 2010 at 11:41 AM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Bharath, This idea is kicking around in academia.. not made into apache yet.. https://issues.apache.org/jira/browse/MAPREDUCE-1211 You can get a working prototype from: http://code.google.com/p/hop/ Ashutosh On Thu, Mar 4, 2010 at 09:06, E. Sammer e...@lifeless.net wrote: On 3/4/10 12:00 PM, bharath v wrote: Hi , Can we pipeline the map output directly into reduce phase without storing it in the local filesystem (avoiding disk IOs). If yes , how to do that ? Bharath: No, there's no way to avoid going to disk after the mappers. -- Eric Sammer e...@lifeless.net http://esammer.blogspot.com
Re: Sun JVM 1.6.0u18
On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote: On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote: u14 added support for the 64bit compressed memory pointers which seemed important due to the fact that hadoop can be memory hungry. u15 has been stable in our deployments. Not saying you should not go newer, but I would not go older then u14. How are the compressed memory pointers working for you? I've been debating turning them on here, so real world experience would be useful from those that have taken plunge. Been using it since they came out, both for Hadoop where needed and in many other applications. Performance gains and memory reduction in most places -- sometimes rather significant (25%). GC times significantly lower for any heap that is reference heavy. Heaps are still a little larger than a 32 bit one, but the benefits of native 64 bit code on x86 include improved computational performance as well. 6u18 introduces some performance enhancements to the feature that we might be able to use if 6u19 fixes the other bugs. The next Hotspot version will make it the default setting, whenever that gets integrated and tested into the JDK6 line. 6u14 and 6u18 are the last two JDK releases with updated Hotspot versions.
Re: Sun JVM 1.6.0u18
On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote: Hey all, Just a note that you should avoid upgrading your clusters to 1.6.0u18. We've seen a lot of segfaults or bus errors on the DN when running with this JVM - Stack found the ame thing on one of his clusters as well. Have you seen this for 32bit, 64 bit, or both? If 64 bit, was it with -XX:+UseCompressedOops? Any idea if there are Sun bugs open for the crashes? I have found some notes that suggest that -XX:-ReduceInitialCardMarks will work around some known crash problems with 6u18, but that may be unrelated. Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a minor patch over 6u16 where 6u18 includes a new version of Hotspot. Can anyone confirm that? We've found 1.6.0u16 to be very stable. -Todd
Re: hadoop idle time on terasort
On 12/8/09 1:24 PM, Vasilis Liaskovitis vlias...@gmail.com wrote: Hi Scott, thanks for the extra tips, these are very helpful. I think the slots are being highly utilized, but I seem to have forgotten which option in the web UI allows you to look at the slot allocations during runtime on each tasktracker. Is this available on the job details from the jobtracker's ui or somewhere else? I usually just look at the jobtracker's main page, that lists the total number of slots for the cluster, and how many are in use. Maps, Reduces, Map Task Capacity, Reduce Task Capacity I usually don't drill down to a per trasktracker basis. I also look at the running jobs on the same main page, and see the allocation of tasks between them. Clicking down to an individual job lets you see the tasks of each job and view their logs. That is usually sufficient (combined with top, mpstat, iostat, etc) to see how the scheduling is working and correlate tasks or phases of hadoop with certain system behavior. Running the fair scheduler -- or any scheduler -- that can be configured to schedule more than one task per heartbeat can dramatically increase your slot utilization if it is low. I 've been running a single job - would the scheduler benefits show up with multiple jobs only, by definition? I am now trying multiple simultaneous sort jobs with smaller disjoint datasets launched in parallel by the same user. Do I need to setup any fairscheduler parameters other than the below? It can help with only one job as well, due to the 'assignmultiple' feature. Often, the default scheduler will assign one task per tasktracker ping. This is not enough to keep up, or to ramp up at the start of the job quickly. This is something you can see in the UI. If jobs are completing faster than the scheduler can assign new tasks the slot utilization will be low. property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.assignmultiple/name valuetrue/value /property property namemapred.fairscheduler.allocation.file/name value/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml/value /property fairscheduler.xml is: ?xml version=1.0? allocations pool name=user2 minMaps12/minMaps minReduces12/minReduces maxRunningJobs16/maxRunningJobs weight4/weight /pool userMaxJobsDefault16/userMaxJobsDefault /allocations With 4 parallel sort jobs, I am noticing that maps execute in parallel across all jobs. But reduce tasks are only allocated/executed from a single job at a time, until that job finishes. Is that expected or am I missing something in my fairscheduler (or other) settings? The config seems fine. The reduce tasks all belonging to one job seems a bit odd to me, usually reduce tasks are distributed amongst jobs as well as map tasks. However, there is another parameter that defines how far along the map phase of a job has to be before it begins scheduling reduces. That, combined with how many total reduce slots you have on your cluster and how many each job is trying to create, might create the behavior you are seeing. I don't think I set the minMaps or minReduces and don't know the defaults. Maybe that is affecting it? See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my comment from June 10 2009. For a single big sort job, I have ~2300maps and 84 reduces on a 7node cluster with 12-core nodes. The thread dumps for my unpatched version also show sleeping threads at fetchOutputs() - I don't know how often you 've seen it in your own task dumps. From what I understand, what we are looking for in the reduce logs, in terms of a shuffle idle bottleneck, is the time elapsed between the shuffling lines to the in-memory merge complete lines, does that sound right? With the one line change in fetchOutput(), I 've sometimes seen the average shuffle time across tasks go down by ~5-10% and so has the execution time. But the results are variable across runs, I need to look at the reduce logs and repeat the experiment. For something like sort, which does not reduce the size of the data prior to a reduce in a combiner or other filter, I would not expect the same massive gains that I did in that ticket. I built a close to worse case example, where each reducer had to fetch 20+ map fragments from each node where the fetch time for each chunk is a few milliseconds. In addition, the remainder of the reduce work is trivial, in your case there is a lot of merging and spilling to do. This will make the logs harder to interpret -- there's more going on. In your 7 node case, with 2300 maps, the default behavior would cause a delay of a couple seconds * (2300/7). Is that roughly the 10% you see? From your system description ticket-318 notes, I see you have configured your cluster to do 10 concurrent shuffle copies. Does this refer to
Re: hadoop idle time on terasort
On 12/2/09 12:22 PM, Vasilis Liaskovitis vlias...@gmail.com wrote: Hi, I am using hadoop-0.20.1 to run terasort and randsort benchmarking tests on a small 8-node linux cluster. Most runs consist of usually low (50%) core utilizations in the map and reduce phase, as well as heavy I/O phases . There is usually a large fraction of runtime for which cores are idling and i/o disk traffic is not heavy. On average for the duration of a terasort run I get 20-30% cpu utilization, 10-30% iowait times and the rest 40-70% is idle time. This is data collected with mpstat for the duration of the run across the cores of a specific node. This utilization behaviour is true and symmetric for all tasktracker/data nodes (The namenode cores and I/O are mostly idle, so there doesn¹t seem to be a bottleneck in the namenode). Look at individual task logs for map and reduce through the UI. Also, look at the cluster utilization during a run -- are most map and reduce slots full most of the time, or is the slot utilization low? Running the fair scheduler -- or any scheduler -- that can be configured to schedule more than one task per heartbeat can dramatically increase your slot utilization if it is low. Next, if you find that your delay times correspond with the shuffle phase (look in the reduce logs), there are fixes in 0.21 for that on the way, but there is a quick win, one line change that cuts shuffle times down a lot on clusters that have a large ratio of map tasks per node if the map output is not too large. For a pure sort test, the map outputs are medium sized (the same size as the input), so this might not help. But the indicators of the issue are in the reduce task logs. See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my comment from June 10 2009. My summary is that the hadoop scheduling process has not been so far for servers that can run more than 6 or so tasks at once. A server capable of running 12 maps is especially prone to running under-utilized. Many changes in the 0.21 timeframe address some of this. thanks, - Vasilis
Re: Optimization of cpu and i/o usage / other bottlenecks?
What Hadoop version? On a clusster this size there are two things to check right away: 1. In the Hadoop UI, during the job, are the reduce and map slots close to being filled up most of the time, or are tasks completing faster than the scheduler can keep up so that there are often many empty slots? For 0.19.x and 0.20.x on a small cluster like this, use the Fair Scheduler and make sure the configuration parameter that allows it to schedule more than one task per heartbeat is on (at least one map and one reduce per, which is supported in 0.19.x). This alone will cut times down if the number of map and reduce tasks is at least 2x the number of nodes. 2. Your CPU, Disk and Network aren't saturated -- take a look at the logs of the reduce tasks and look for long delays in the shuffle. Utilization is throttled by a bug in the reducer shuffle phase, not fixed until 0.21. Simply put, a single reduce task won't fetch more than one map output from another node every 2 seconds (though it can fetch from multiple nodes at once). Fix this by commenting out one line in 0.18.x, 0.19.x or 0.20.x -- see my comment here: http://issues.apache.org/jira/browse/MAPREDUCE-318 From June 10 2009. I saw shuffle times on small clusters with large map task count per node ratio decrease by a factor of 30 from that one line fix. It was the only way to get the network to ever be close to saturation on any node. The delays for low latency jobs on smaller clusters are predominantly artificial due to the nature of most RPC being ping-response and most design and testing done for large clusters of machines that only run a couple maps or reduces per TaskTracker. Do the above, and you won't be nearly as sensitive to the size of data per task for low latency jobs as out-of-the-box Hadoop. Your overall utilization will go up quite a bit. -Scott On 10/14/09 7:31 AM, Chris Seline ch...@searchles.com wrote: No, there doesn't seem to be all that much network traffic. Most of the time traffic (measured with nethogs) is about 15-30K/s on the master and slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe 5-10 seconds on a query that takes 10 minutes, but that is still less than what I see in scp transfers on EC2, which is typically about 30 MB/s. thanks Chris Jason Venner wrote: are your network interface or the namenode/jobtracker/datanodes saturated On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com wrote: I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance available with 20 compute units and 4x 400gb disks. I wrote some scripts to test many (100's) of configurations running a particular Hive query to try to make it as fast as possible, but no matter what I don't seem to be able to get above roughly 45% cpu utilization on the slaves, and not more than about 1.5% wait state. I have also measured network traffic and there don't seem to be bottlenecks there at all. Here are some typical CPU utilization lines from top on a slave when running a query: Cpu(s): 33.9%us, 7.4%sy, 0.0%ni, 56.8%id, 0.6%wa, 0.0%hi, 0.5%si, 0.7%st Cpu(s): 33.6%us, 5.9%sy, 0.0%ni, 58.7%id, 0.9%wa, 0.0%hi, 0.4%si, 0.5%st Cpu(s): 33.9%us, 7.2%sy, 0.0%ni, 56.8%id, 0.5%wa, 0.0%hi, 0.6%si, 1.0%st Cpu(s): 38.6%us, 8.7%sy, 0.0%ni, 50.8%id, 0.5%wa, 0.0%hi, 0.7%si, 0.7%st Cpu(s): 36.8%us, 7.4%sy, 0.0%ni, 53.6%id, 0.4%wa, 0.0%hi, 0.5%si, 1.3%st It seems like if tuned properly, I should be able to max out my cpu (or my disk) and get roughly twice the performance I am seeing now. None of the parameters I am tuning seem to be able to achieve this. Adjusting mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the io.file.buffer.size to 4096 does better than the default, but the rest of the values I am testing seem to have little positive effect. These are the parameters I am testing, and the values tried: io.sort.factor=2,3,4,5,10,15,20,25,30,50,100 mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9 0,0.93,0.95,0.97,0.98,0.99 io.bytes.per.checksum=256,512,1024,2048,4192 mapred.output.compress=true,false hive.exec.compress.intermediate=true,false hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9 0,0.93,0.95,0.97,0.98,0.99 mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200 mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m ,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200 mapred.merge.recordsBeforeProgress=5000,1,2,3 mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0 .80,0.90,0.93,0.95,0.99 io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95 ,0.99
Re: Faster alternative to FSDataInputStream
If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: it's not really 1 mbps so much it takes 2 minutes to start doing the reads. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.comwrote: On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Re: Faster alternative to FSDataInputStream
On 8/20/09 9:48 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: ok.. i seems that's the case. that seems kind of selfdefeating though. Ananth T Sarathy Then something is wrong with S3. It may be misconfigured, or just poor performance. I have no experience with S3 but 20 seconds to connect (authenticate?) and open a file seems very slow for any file system. On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey sc...@richrelevance.comwrote: If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: it's not really 1 mbps so much it takes 2 minutes to start doing the reads. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.com wrote: On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Re: NN memory consumption on 0.20/0.21 with compressed pointers/
On 8/20/09 3:40 AM, Steve Loughran ste...@apache.org wrote: does anyone have any up to date data on the memory consumption per block/file on the NN on a 64-bit JVM with compressed pointers? The best documentation on consumption is http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wondering if anyone has looked at the memory footprint on the latest Hadoop releases, on those latest JVMs? -and which JVM the numbers from HADOOP-1687 came from? Those compressed pointers (which BEA JRockit had for a while) save RAM when the pointer references are within a couple of GB of the other refs, and which are discussed in some papers http://rappist.elis.ugent.be/~leeckhou/papers/cgo06.pdf http://www.elis.ugent.be/~kvenster/papers/VenstermansKris_ORA.pdf sun's commentary is up here http://wikis.sun.com/display/HotSpotInternals/CompressedOops I'm just not sure what it means for the NameNode, and as there is no sizeof() operator in Java, something that will take a bit of effort to work out. From what I read of the Sun wiki, when you go compressed, while your heap is 3-4GB, there is no decompress operation; once you go above that there is a shift and an add, which is probably faster than fetching another 32 bits from $L2 or main RAM. The result could be -could be- that your NN takes up much less space on 64 bit JVMs than it did before, but is no slower. The implementation in JRE 6u14 uses a shift for all heap sizes, the optimization to remove that for heaps less than 4GB is not in the hotspot version there (but will be later). The size advantage is there either way. I have not tested an app myself that was not faster using -XX:+UseCompressedOops on a 64 bit JVM. The extra bit shifting is overshadowed by how much faster and less frequent GC is with a smaller dataset. Has anyone worked out the numbers yet? -steve Every Java reference is 4 bytes instead of 8, and for several types -- arrays in particular -- the object is also 4 bytes smaller. Given that the NN data structures have plenty of references, a 30% reduction in memory used would not be a surprise. Collection classes in particular are near half the size.
Re: Faster alternative to FSDataInputStream
On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Re: What OS?
On 8/14/09 6:27 AM, Brian Bockelman bbock...@cse.unl.edu wrote: On Aug 13, 2009, at 10:27 PM, Jason Venner wrote: Anyone have any performance numbers for Solaris or ZFS based datanodes. The directory and inode cache sizes are a limiting factor for linux for large and busy datanodes. I haven't run into this at all, and we have quite large and busy datanodes. However, I would recommend making sure you pick an OS you are comfortable administrating. It doesn't do you any good to run Solaris due to speed (whatever the performance may be, better or worse) if it takes you twice as long to get basic admin tasks done. I haven't benchmarked our Solaris nodes vs Linux nodes. However, anecdotally, HDFS on Solaris/ZFS consumes significantly more CPU than HDFS on Linux/ext3. Brian I wonder if the extra CPU has anything to do with the ZFS checksums. Perhaps it is lower with ZFS checksums off? Since HDFS is already doing checksums on the data that should be safe. On the other hand, with ZFS you can get transparent, very fast compression for free. ext3 tends to get very fragmented very fast if there are concurrent writes. XFS avoids that but only if you set the allocsize mount parameter large enough. In theory, ZFS should avoid fragmentation fairly well for write-once data like HDFS but I have no experience with that in practice. On Wed, Aug 12, 2009 at 7:45 AM, tim robertson timrobertson...@gmail.com wrote: Thanks guys. I'll chat with sys admin and see what he thinks. We knew fedora would require a 6 month rebuild On Wed, Aug 12, 2009 at 4:36 PM, Edward Caprioloedlinuxg...@gmail.com wrote: On Wed, Aug 12, 2009 at 8:03 AM, Brian Bockelmanbbock...@cse.unl.edu wrote: Hey Tim, One consideration is how long is this OS version going to be receiving updates? or Do I do the operations team any favor by having them upgrade every 6 months? Personally, I'd avoid Fedora for a production cluster because the lack of long-lived releases means that you'll be spending extra effort on upgrading the OS. Brian On Aug 12, 2009, at 6:05 AM, tim robertson wrote: Hi all, Is fedora a decent choice of OS for a new hadoop cluster? All our other stuff is fedora, but is there was a strong case to move to something else? Cheers Tim CentOS and Scientific Linux are Red Hat Enterprise Linux clones. I advice people to go with them. Most of this is based on the fact that CentOS is very compatible with RHEL. This is important because packaged, but not open source software, is typically targeted at RHEL. You can read about someone trying to install WebSphere on say Fedora Core and see the hard aches. As mentioned above support life is an issue. RHEL/CENT 5 will be supported until 2014. http://www.redhat.com/security/updates/errata/ The Fedora line typically has support life of a few months. So your package support dries up fast and then you have to get good with RPM-build fast :) -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Map performance with custom binary format
On 7/30/09 2:32 PM, william kinney william.kin...@gmail.com wrote: Hrm I think I found an issue. In my RecordReader, I do an Arrays.copyOfRange() to get the protobuf binary section for RecordReader.next(key,value). In the profile dumps from child map processes, this one call takes up ~90% of the CPU Samples time. So, I wrapped the line w/ a System.nanoTime(), and got: Local process: total(ms): 902.138069, avg: 112.145 ns Hadoop Child processes: 1) total(ms): 6953.47106, avg: 726.906 ns 2) total(ms): 6503.962176, avg: 802.270 ns 3) total(ms): 5482.494256, avg: 677.589 ns 4) total(ms): 5291.664592, avg: 661.764 ns 5) total(ms): 5568.289465, avg: 697.353 ns 6) total(ms): 5638.778551, avg: 702.290 ns ...etc So for some reason, that call is taking over 6 times longer in hadoop... The buffer size for it is 65536 for both processes. Any ideas? That is a very interesting result. Try counting the number of times that the above is called to ensure that is the same for both -- if the average size of the copy is much smaller it will be slower. Other ideas -- is one using a native ByteBuffer underneath the covers somewhere and the other not? Is there some other difference in buffering on either side of that copy? -- You might want to try -Xmx486m as an experiment on the local test to see if it affects the behavior. If you are doing a lot of garbage creation it may. -Tried it, no changes. Hmm, that was a random guess, because it would obviously affect CPU use. Another thing to try -- make sure your writer that is writing into HDFS is wrapped with a buffer (try 32k or 64k). That's another random guess for something that might not show up well in stack traces without a profiler -- but also might already be done - So you're saying when writing the file into the HDFS, I should make sure it ends in 64k chunk (ie, zero-out until i reach such a point)? So all file sizes are a multiple of 64kb? No, just that its using something like a BufferedOutputStream when writing from your custom format out (HDFS does this itself so it shouldn't be necessary) and BufferedInputStream for reading. There is definitely a mystery here. I expect the task scheduling delays and some startup inefficiency but the overall difference is odd. What about a local test on a single, larger file versus a hadoop job on that same single, larger file (which would have just one map job)? This test may be very enlightening. - Total job time was 20 seconds for the 506MB file. Task took 19 seconds. Local process on the same file took ~ 3 seconds. Ok, so drilling down here is where we need to look (and what the results above are). Scheduling may be a few seconds of that. Hmm that difference seems a bit troubling. For one, you are running two tasks at once per node -- is there any way to do your local, non MR test with two concurrent processes or threads? - Does the above test answer this? Only one task was executed on the node that took 19 seconds. Yeah, it looks like we have ruled that out. ext3 filesystem. Make sure your OS readahead on the device is set to a good value (at least 2048 blocks, preferably 8192 ish): - For RA its showing 256, BSZ is 2048. RA should be 8192 ? Should BSZ then be larger? What about SSZ? I'm referring to /sbin/blockdev --getra device Which is just RA. SSZ is sector size -- that can't change, and I think BSZ is block size, and is also static. Use /sbin/blockdev/ --setra value device to set the readahead. This will increase sequential throughput somewhat at the device level, but moreso if there are two or more concurrent reads. It doesn't affect random I/O performance. Basically, if the block layer detects a sequence of I/O's that are sequential, it starts reading ahead of the last I/O and keeps increasing the size of this readahead as long as the sequential access continues, up to a max size. Since the performance is high for the local process, would that then mean my disk i/o is sufficient, as you suggested? Do I still need to change any of these settings? You seem CPU bound, especially considering your evidence above. I/O tuning might help somewhere, but not this use case. If even a single task on a single large file is slower in MB/sec than your test program, then I suspect read/write buffer issues or misuse somewhere. - Do you know of an instance where I'd have buffer issues with the Child process, and not local? The only difference I can think of is of course how the buffer is filled, FileInputStream vs FSDataInputStream. But once it is filled, why would reading portions of that buffer (ie, Arrays.copyOfRange()) take long in one instance but not another? I am not familiar enough with that part of Hadoop to know. In general, that buffer may be too small, or be backed by a Native ByteBuffer which will be slow for small reads into Java memory. Would it be helpful to get a histogram of the
Re: Map performance with custom binary format
Well, the first thing to do in any performance bottleneck investigation is to look at the machine hardware resource usage. During your test, what is the CPU use and disk usage? What about network utilization? Top, vmstat, iostat, and some network usage monitoring would be useful. It could be many things causing your lack of scalability, but without actually monitoring your machines to see if there is an obvious bottleneck its just random guessing and hunches. On 7/28/09 8:18 AM, william kinney william.kin...@gmail.com wrote: Hi, Thanks in advance for the help! I have a performance question relating to how fast I can expect Hadoop to scale. Running Cloudera's 0.18.3-10. I have custom binary format, which is just Google Protocol Buffer (protobuf) serialized data: 669 files, ~30GB total size (ranging 10MB to 100MB each). 128MB block size. 10 Hadoop Nodes. I tested my InputFormat and RecordReader for my input format, and it showed about 56MB/s performance (single thread, no hadoop, passed in test file via FileInputFormat instead of FSDataInputStream) on hardware similar to what I have in my cluster. I also then tested some simple Map logic along w/ the above, and got around 54MB/s. I believe that difference can be accounted for parsing the protobuf data into java objects. Anyways, when I put this logic into a job that has - no reduce (.setNumReduceTasks(0);) - no emit - just protobuf parsing calls (like above) I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. So my question, why is the rate only 2x what I see on a single thread, non-hadoop test? Would it not be: 54MB/s x 10 (Num Nodes) - small hadoop overhead ? Is there any area of my configuration I should look into for tuning? Anyway I could get more accurate performance monitoring of my job? On a side note, I tried the same job after combining the files into about 11 files (still 30GB in size), and actually saw a decrease in performance (~90MB/s). Any help is appreciated. Thanks! Will some hadoop-site.xml values: dfs.replication 3 io.file.buffer.size 65536 dfs.datanode.handler.count 3 mapred.tasktracker.map.tasks.maximum 6 dfs.namenode.handler.count 5
Re: Disk configuration.
For both the DN and TT you can provide a comma separated list of directories. So, drive 1 could be /hadoop1 And drive 2 /hadoop2 Then in each of those there could be a dfs directory and another for task temp storage. Hadoop will round-robin writes to these automatically. Dfs.data.dir might look something like: property namedfs.data.dir/name value/hadoop1/dfs/data,/hadoop2/dfs/data/value descriptionDetermines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. /description /property And the local mapreduce dir might look like: property namemapred.local.dir/name value/hadoop1/tmp,/hadoop2/tmp/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. /description /property On 7/13/09 11:50 AM, Dmitry Pushkarev u...@stanford.edu wrote: Hi. We're running a small 30 node cluster and in a few days will reinstall the whole software, thus I want to change HDD configuration that was done long time ago and seems to be inefficient - each node has 2x1TB drives that are LVMed to one single volume. How do people usually setup drives? For example will it be better to mount them to two separate folders and feed these folder to both tasktracker and datanode? Or setup LVM with raid 0 to maximize bandwidth. What I want is that 2TB of drive space per node were equally accessible to both tasktracker and datanode, and I'm not sure that mounting two drives to separate folders achieves that. (for example if reducer fills one drive will it start writing the rest of the data to second drive? ) Thanks.