Re: Container size
The AM gets this information from RM via the return value for AMRMProtocol.registerApplicationMaster. On Jan 17, 2012, at 11:24 PM, raghavendhra rahul wrote: > Hi, > > What is the minimum size of the container in hadoop yarn. >capability.setmemory(xx);
Container size
Hi, What is the minimum size of the container in hadoop yarn. capability.setmemory(xx);
Re: Using S3 instead of HDFS
Ah sorry about missing that. Settings would go in core-site.xml (hdfs-site.xml will no longer be relevant anymore, once you switch to using S3). On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote: > That wiki page mentiones hadoop-site.xml, but this is old, now you have > core-site.xml and hdfs-site.xml, so which one do you put it in? > > Thank you (and good night Central Time:) > > mark > > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J wrote: > >> When using S3 you do not need to run any component of HDFS at all. It >> is meant to be an alternate FS choice. You need to run only MR. >> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on >> how to go about specifying your auth details to S3, either directly >> via the fs.default.name URI or via the additional properties >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work >> for you? >> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner >> wrote: >>> Well, here is my error message >>> >>> Starting Hadoop namenode daemon: starting namenode, logging to >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out >>> ERROR. Could not start Hadoop namenode daemon >>> Starting Hadoop secondarynamenode daemon: starting secondarynamenode, >>> logging to >>> >> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out >>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid >> URI >>> for NameNode address (check fs.default.name): s3n://myname.testdata is >> not >>> of scheme 'hdfs'. >>> at >>> >> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) >>> at >>> >> org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) >>> at >>> >> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) >>> at >>> >> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150) >>> at >>> >> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) >>> ERROR. Could not start Hadoop secondarynamenode daemon >>> >>> And, if I don't need to start the NameNode, then where do I give the S3 >>> credentials? >>> >>> Thank you, >>> Mark >>> >>> >>> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J wrote: >>> Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner < >> mark.kerz...@shmsoft.com> wrote: > Hi, > > whatever I do, I can't make it work, that is, I cannot use > > s3://host > > or s3n://host > > as a replacement for HDFS while runnings EC2 cluster. I change the settings > in the core-file.xml, in hdfs-site.xml, and start hadoop services, >> and it > fails with error messages. > > Is there a place where this is clearly described? > > Thank you so much. > > Mark -- Harsh J Customer Ops. Engineer, Cloudera >> >> >> >> -- >> Harsh J >> Customer Ops. Engineer, Cloudera >> -- Harsh J Customer Ops. Engineer, Cloudera
Re: Is it possible to set how many map slots to use on each job submission?
Edward, You need to invest in configuring a non-FIFO scheduler. FairScheduler may be what you are looking for. Take a look at http://hadoop.apache.org/common/docs/current/fair_scheduler.html for the docs. On 18-Jan-2012, at 12:27 PM, edward choi wrote: > Hi, > > I often run into situations like this: > I am running a very heavy job(let's say job 1) on a hadoop cluster(which > takes many hours). Then something comes up that needs to be done very > quickly(let's say job 2). > Job 2 only takes a couple of hours when executed on hadoop. But it will > take a couple ten hours if run on a single machine. > So I'd definitely want to use Hadoop for job 2. But since job 1 is already > running on Hadoop and hogging all the map slots, I can't run job 2 on > hadoop(it will only be queued). > > So I was wondering: > Is there a way to set a specific number of map slots(or the number of slave > nodes) to use when submitting each job? > I read that setNumMapTasks() is not an enforcing configuration. > I would like to leave a couple of map slots free for occasions like above. > > Ed -- Harsh J Customer Ops. Engineer, Cloudera
Re: Using S3 instead of HDFS
That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and hdfs-site.xml, so which one do you put it in? Thank you (and good night Central Time:) mark On Wed, Jan 18, 2012 at 12:52 AM, Harsh J wrote: > When using S3 you do not need to run any component of HDFS at all. It > is meant to be an alternate FS choice. You need to run only MR. > > The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on > how to go about specifying your auth details to S3, either directly > via the fs.default.name URI or via the additional properties > fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work > for you? > > On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner > wrote: > > Well, here is my error message > > > > Starting Hadoop namenode daemon: starting namenode, logging to > > /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out > > ERROR. Could not start Hadoop namenode daemon > > Starting Hadoop secondarynamenode daemon: starting secondarynamenode, > > logging to > > > /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out > > Exception in thread "main" java.lang.IllegalArgumentException: Invalid > URI > > for NameNode address (check fs.default.name): s3n://myname.testdata is > not > > of scheme 'hdfs'. > >at > > > org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) > >at > > > org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) > >at > > > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) > >at > > > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150) > >at > > > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) > > ERROR. Could not start Hadoop secondarynamenode daemon > > > > And, if I don't need to start the NameNode, then where do I give the S3 > > credentials? > > > > Thank you, > > Mark > > > > > > On Wed, Jan 18, 2012 at 12:36 AM, Harsh J wrote: > > > >> Hey Mark, > >> > >> What is the exact trouble you run into? What do the error messages > >> indicate? > >> > >> This should be definitive enough I think: > >> http://wiki.apache.org/hadoop/AmazonS3 > >> > >> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner < > mark.kerz...@shmsoft.com> > >> wrote: > >> > Hi, > >> > > >> > whatever I do, I can't make it work, that is, I cannot use > >> > > >> > s3://host > >> > > >> > or s3n://host > >> > > >> > as a replacement for HDFS while runnings EC2 cluster. I change the > >> settings > >> > in the core-file.xml, in hdfs-site.xml, and start hadoop services, > and it > >> > fails with error messages. > >> > > >> > Is there a place where this is clearly described? > >> > > >> > Thank you so much. > >> > > >> > Mark > >> > >> > >> > >> -- > >> Harsh J > >> Customer Ops. Engineer, Cloudera > >> > > > > -- > Harsh J > Customer Ops. Engineer, Cloudera >
Is it possible to set how many map slots to use on each job submission?
Hi, I often run into situations like this: I am running a very heavy job(let's say job 1) on a hadoop cluster(which takes many hours). Then something comes up that needs to be done very quickly(let's say job 2). Job 2 only takes a couple of hours when executed on hadoop. But it will take a couple ten hours if run on a single machine. So I'd definitely want to use Hadoop for job 2. But since job 1 is already running on Hadoop and hogging all the map slots, I can't run job 2 on hadoop(it will only be queued). So I was wondering: Is there a way to set a specific number of map slots(or the number of slave nodes) to use when submitting each job? I read that setNumMapTasks() is not an enforcing configuration. I would like to leave a couple of map slots free for occasions like above. Ed
Re: Using S3 instead of HDFS
When using S3 you do not need to run any component of HDFS at all. It is meant to be an alternate FS choice. You need to run only MR. The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on how to go about specifying your auth details to S3, either directly via the fs.default.name URI or via the additional properties fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work for you? On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner wrote: > Well, here is my error message > > Starting Hadoop namenode daemon: starting namenode, logging to > /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out > ERROR. Could not start Hadoop namenode daemon > Starting Hadoop secondarynamenode daemon: starting secondarynamenode, > logging to > /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out > Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI > for NameNode address (check fs.default.name): s3n://myname.testdata is not > of scheme 'hdfs'. > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) > ERROR. Could not start Hadoop secondarynamenode daemon > > And, if I don't need to start the NameNode, then where do I give the S3 > credentials? > > Thank you, > Mark > > > On Wed, Jan 18, 2012 at 12:36 AM, Harsh J wrote: > >> Hey Mark, >> >> What is the exact trouble you run into? What do the error messages >> indicate? >> >> This should be definitive enough I think: >> http://wiki.apache.org/hadoop/AmazonS3 >> >> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner >> wrote: >> > Hi, >> > >> > whatever I do, I can't make it work, that is, I cannot use >> > >> > s3://host >> > >> > or s3n://host >> > >> > as a replacement for HDFS while runnings EC2 cluster. I change the >> settings >> > in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it >> > fails with error messages. >> > >> > Is there a place where this is clearly described? >> > >> > Thank you so much. >> > >> > Mark >> >> >> >> -- >> Harsh J >> Customer Ops. Engineer, Cloudera >> -- Harsh J Customer Ops. Engineer, Cloudera
Re: Using S3 instead of HDFS
Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3n://myname.testdata is not of scheme 'hdfs'. at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) ERROR. Could not start Hadoop secondarynamenode daemon And, if I don't need to start the NameNode, then where do I give the S3 credentials? Thank you, Mark On Wed, Jan 18, 2012 at 12:36 AM, Harsh J wrote: > Hey Mark, > > What is the exact trouble you run into? What do the error messages > indicate? > > This should be definitive enough I think: > http://wiki.apache.org/hadoop/AmazonS3 > > On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner > wrote: > > Hi, > > > > whatever I do, I can't make it work, that is, I cannot use > > > > s3://host > > > > or s3n://host > > > > as a replacement for HDFS while runnings EC2 cluster. I change the > settings > > in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it > > fails with error messages. > > > > Is there a place where this is clearly described? > > > > Thank you so much. > > > > Mark > > > > -- > Harsh J > Customer Ops. Engineer, Cloudera >
Re: Using S3 instead of HDFS
Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner wrote: > Hi, > > whatever I do, I can't make it work, that is, I cannot use > > s3://host > > or s3n://host > > as a replacement for HDFS while runnings EC2 cluster. I change the settings > in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it > fails with error messages. > > Is there a place where this is clearly described? > > Thank you so much. > > Mark -- Harsh J Customer Ops. Engineer, Cloudera
Using S3 instead of HDFS
Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark
RE: How to find out whether a node is Overloaded from Cpu utilization ?
Guys ! So can i say that if memory usage is more than say 90 % the node is overloaded. If so, what can be that threshold percent value or how can we find it ? Arun -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-find-out-whether-a-node-is-Overloaded-from-Cpu-utilization-tp3665289p3668167.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: race condition in hadoop 0.20.2 (cdh3u1)
On Tue, Jan 17, 2012 at 6:38 PM, Brock Noland wrote: > This class is invalid. A single thread will be executing your mapper > or reducer but there will be multiple threads (background threads such > as the SpillThread) creating MyKey instances which is exactly what you > are seeing. This is by design. > Could you please refer me to where this design decision/assumption is/was documented? Imho, this assumption clashes with the overall object re-use methodology. I would have at least considered making 'readFields' and 'write' synchronized, even if it is to indicate that there are multiple threads executing serialization/de-serialization. (As only a few threads are competing in this case, the performance penalty would have been negligible.) Thanks, stan
Re: race condition in hadoop 0.20.2 (cdh3u1)
Hi, tl;dr DUMMY should not be static. On Tue, Jan 17, 2012 at 3:21 PM, Stan Rosenberg wrote: > > > class MyKey implements WritableComparable { > private String ip; // first part of the key > private final static Text DUMMY = new Text(); > ... > > public void write(DataOutput out) throws IOException { > // serialize the first part of the key > DUMMY.set(ip); > DUMMY.write(out); > ... > } > > public void readFields(DataInput in) throws IOException { > // de-serialize the first part of the key > DUMMY.readFields(in); ip = DUMMY.toString(); > > } > } This class is invalid. A single thread will be executing your mapper or reducer but there will be multiple threads (background threads such as the SpillThread) creating MyKey instances which is exactly what you are seeing. This is by design. Brock
Re: NameNode per-block memory usage?
Konstantin's paper http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf mentions that on average a file consumes about 600 bytes of memory in the name-node (1 file object + 2 block objects). To quote from his paper (see page 9) ".. in order to store 100 million files (referencing 200 million blocks) a name-node should have at least 60GB of RAM. This matches observations on deployed clusters". On Tue, Jan 17, 2012 at 7:08 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello, > > How much memory/JVM heap does NameNode use for each block? > > I've tried locating this in the FAQ and on search-hadoop.com, but > couldn't find a ton of concrete numbers, just these two: > > http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block? > http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? > > Thanks, > Otis >
Error Using Hadoop .20.2/Mahout.4 on Solr 3.4
Hi Guys, I'm running a Clojure code inside Solr 3.4 that makes call to Mahout .4 for some text clustering job. Due to some issues with Clojure I had to put all the jar files in the solr war file ('WEB-INF/lib'). I also made sure to put hadoop core and mapreduce config xml files in the same location with a value of ('file:/// or hdfs://localhosthost:9000..) for 'fs.default.name'. HOWEVER i get the following stack trace when running the code. I notices a bug fix in Hadoop Yarn related to running Unit-Test with the same trace. However I know for sure my code has worked for others using Hadoop .20.2. Any ideas what could be wrong? Thanks, Peyman SEVERE: java.lang.IllegalStateException: Variable substitution depth too large: 20 ${fs.default.name} at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:366) at org.apache.hadoop.conf.Configuration.get(Configuration.java:436) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at lsa4solr.mahout_matrix$distributed_matrix.invoke(mahout_matrix.clj:71) at lsa4solr.clustering_protocol$get_frequency_matrix.invoke(clustering_protocol.clj:92) at lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:105) at lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:123) at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:57) at lsa4solr.cluster$_cluster.invoke(cluster.clj:85) at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source) at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Jan 16, 2012 11:42:17 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/lsa4solr params={nclusters=2&k=10&q=Summary:.*&rows=100&algorithm=kmeans} hits=0 status=500 QTime=37 Jan 16, 2012 11:42:17 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalStateException: Variable substitution depth too large: 20 ${fs.default.name} at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:366) at org.apache.hadoop.conf.Configuration.get(Configuration.java:436) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at lsa4solr.mahout_matrix$distributed_matrix.invoke(mahout_matrix.clj:71) at lsa4solr.clustering_protocol$get_frequency_matrix.invoke(clustering_protocol.clj:92) at lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:105) at lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:123) at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:57) at lsa4solr.cluster$_cluster.invoke(cluster.clj:85) at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source) at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) at org.apache.solr.handler.co
race condition in hadoop 0.20.2 (cdh3u1)
Hi, This posting is essentially about a bug, but it is also related to a programmatic idiom endemic to hadoop. Thus, I am posting to 'common-user' as opposed to 'common-dev'; if the latter is more appropriate, please let me know. Also, I checked jira and was unable to find a bug match. Synopsis = It appears that there is a race condition between the spill thread and the main thread. The race condition results in data corruption, specifically ArrayIndexOutOfBoundsException in org.apache.hadoop.mapred.MapTask on line 1134, when running hadoop 0.20.2 (cdh3u1). The race occurs when the spill thread is executing 'readFields' on key(s) while concurrently the main thread is executing 'write' in order to serialize mapper's output. Although this is cloudera's distribution, I believe this also affects apache's distribution. Details == Now let's delve into the details. In order to avoid object allocation, we are led to rely on the singleton pattern. So, consider this code snippet which illustrates a custom key implementation which for the sake of performance re-uses an instance of Text: class MyKey implements WritableComparable { private String ip; // first part of the key private final static Text DUMMY = new Text(); ... public void write(DataOutput out) throws IOException { // serialize the first part of the key DUMMY.set(ip); DUMMY.write(out); ... } public void readFields(DataInput in) throws IOException { // de-serialize the first part of the key DUMMY.readFields(in); ip = DUMMY.toString(); } } The problem with the above is that DUMMY is a shared variable which shouldn't matter in single-threaded environments. However, the spill thread and the main thread are running concurrently, and as exhibited by the stack traces below, the data race ensues. Note that the actual exception is java.lang.ArrayIndexOutOfBoundsException in org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1134). In order to produce the stack traces, I caught the actual exception inside 'write' of my custom key class and dumped the stack traces of all running threads. Thread[main,5,main] java.lang.Thread.dumpThreads(Native Method) java.lang.Thread.getAllStackTraces(Unknown Source) com.proclivitysystems.etl.psguid.PSguidOrTid.write(PSguidOrTid.java:206) org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:918) org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574) org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) com.proclivitysystems.etl.psguid.WECPreprocessor.map(WECPreprocessor.java:142) com.proclivitysystems.etl.psguid.WECPreprocessor.map(WECPreprocessor.java:25) org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) org.apache.hadoop.mapred.Child$4.run(Child.java:270) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Unknown Source) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) org.apache.hadoop.mapred.Child.main(Child.java:264) Thread[SpillThread,5,main] java.lang.Thread.dumpThreads(Native Method) java.lang.Thread.getAllStackTraces(Unknown Source) com.proclivitysystems.etl.psguid.PSguidOrTid.readFields(PSguidOrTid.java:240) org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:125) org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:968) org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:101) org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59) org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1254) org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:712) org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1199) java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException at com.proclivitysystems.etl.psguid.PSguidOrTid.write(PSguidOrTid.java:214) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:918) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574
org.apache.hadoop.mapred.Merger merge bug
I think I've found a bug in the Merger code for Hadoop. When the Map job runs, it creates spill files based on io.sort.mb. It then sorts io.sort.factor files at a time in order to create an output file that's passed to the reduce job. The higher these two settings are configured, the more memory is used. However, as far as I can tell, the memory used is the same no matter what the io.sort parameters are set to. For example, with io.sort.mb of 256, io.sort factor of 10 and 10 spill files, we get the following scenario. The merger merges all 10 of those spill files into one output file, using roughly 2.5GB of memory. If we change the io.sort.factor to 4, then the merger will merge 4 of the 10 spill files and output the result as a temp file on the hard drive. It then adds the resulting file back into the merge queue. It repeats this action with the next 4 spill files. Now we have 2 spill files remaining and the 2 temp files which are each 4 spill files combined. So on the third pass of the merger, we're back to merging everything into one output file, using roughly 2.5GB of memory. No matter what you set your io.sort.factor to, you will eventually end up using the same amount of memory. It's just that lower factors will take longer due to the intermediate steps. As such, if you only have 2GB of memory available for the Map job, you will get an OutOfMemoryException every time you attempt to run the job. Can anyone confirm what I'm seeing or point out any flaws in my reasoning? Thanks.
Re: effect on data after topology change
Thank you very much Todd. I hope futute versions of hadoop rebalcer will include this check. I have one more question. If we are in the process of setting up additional nodes incrementally in different rack (say rack-2) and rack-2 size is only 25% of rack-1, how would data be balanced (with default implementation)? i.e Will hadoop prefers balancing the overall nodes or will it try to obey the topology first that could fillup rack-2 quickly?. I am positive that it will try to balance overall nodes but want to be sure. Thanks and Regards Ravi On Tue, Jan 17, 2012 at 10:41 AM, Todd Lipcon wrote: > Hi Ravi, > > You'll probably need to up the replication level of the affected files > and then drop it back down to the desired level. Current versions of > HDFS do not automatically repair rack policy violations if they're > introduced in this manner. > > -Todd > > On Mon, Jan 16, 2012 at 3:53 PM, rk vishu wrote: > > Hello All, > > > > If i change the rackid for some nodes and restart namenode, will data be > > rearranged accordingly? Do i need to run rebalancer? > > > > Any information on this would be appreciated. > > > > Thanks and Regards > > Ravi > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
Re: effect on data after topology change
Hi Ravi, You'll probably need to up the replication level of the affected files and then drop it back down to the desired level. Current versions of HDFS do not automatically repair rack policy violations if they're introduced in this manner. -Todd On Mon, Jan 16, 2012 at 3:53 PM, rk vishu wrote: > Hello All, > > If i change the rackid for some nodes and restart namenode, will data be > rearranged accordingly? Do i need to run rebalancer? > > Any information on this would be appreciated. > > Thanks and Regards > Ravi -- Todd Lipcon Software Engineer, Cloudera
RE: How to find out whether a node is Overloaded from Cpu utilization ?
Hi, The significant factor in cluster loading is memory, not CPU. Hadoop views the cluster only with respect to memory and cares not about CPU utilization or Disk saturation. If you run too many TaskTrackers, you risk memory overcommit where the Linux OOM will come out of the closet and randomly kill processes which will ultimately take out the box. Disk saturation is another issue that will contribute to datanode timeouts, which lead to a downward spiral where the namenode will think the datanode is down and start replicating its blocks adding to the overall I/O loading contributing to other datanode timeouts. If your datanode CPU's are pegged then this contributes as well, but not as much. Generally, if you carve up your cluster properly, you won't have CPU over utilization issues, quite the opposite. -Bill -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Mon 1/16/2012 10:21 PM To: common-user@hadoop.apache.org Cc: hadoop-u...@lucene.apache.org Subject: Re: How to find out whether a node is Overloaded from Cpu utilization ? Arun, I don't think you'll hear a fixed number. Having said that, I have seen CPU being pegged at 95% during jobs and the cluster working perfectly fine. On the slaves, if you have nothing else going on, Hadoop only has TaskTrackers and DataNodes. Those two daemons are relatively light weight in terms of CPU for the most part. So, you can afford to let your tasks take up a high %. Hope that helps. -Amandeep On Tue, Jan 17, 2012 at 2:16 PM, ArunKumar wrote: > Hi Guys ! > > When we get CPU utilization value of a node in hadoop cluster, what > percent > value can be considered as overloaded ? > Say for eg. > >CPU utilizationNode Status > 85% Overloaded > 20%Normal > > > Arun > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-find-out-whether-a-node-is-Overloaded-from-Cpu-utilization-tp3665289p3665289.html > Sent from the Hadoop lucene-users mailing list archive at Nabble.com. >
Re: NameNode per-block memory usage?
On Tue, Jan 17, 2012 at 10:08 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello, > > How much memory/JVM heap does NameNode use for each block? > > I've tried locating this in the FAQ and on search-hadoop.com, but > couldn't find a ton of concrete numbers, just these two: > > http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block? > http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? > > Thanks, > Otis > Some real world statistics. From NN web Interface. replication factor=2 Cluster Summary 22,061,605 files and directories, 22,151,870 blocks = 44,213,475 total. Heap Size is 10.85 GB / 16.58 GB (65%) compressedOOps is enabled.
Re: NameNode per-block memory usage?
> How much memory/JVM heap does NameNode use for each block? I don't remember the exact number, it also depends on which version of Hadoop you're using > http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? It's 1 GB for every *million* objects (files, blocks, etc.). This is a good rule of thumb, at least for the 0.20.x/1.0.0 series. Is there a reason you need a more exact estimate? -Joey -- Joseph Echeverria Cloudera, Inc. 443.305.9434
NameNode per-block memory usage?
Hello, How much memory/JVM heap does NameNode use for each block? I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find a ton of concrete numbers, just these two: http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block? http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? Thanks, Otis
Re: hadoop filesystem cache
My intention isn't to make it a mandatory feature just as an option. Keeping data locally on a filesystem as a method of Lx cache is far better than getting it from the network and the cost of fs buffer cache is much cheaper than a RPC call. On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo wrote: > The challenges of this design is people accessing the same data over and > over again is the uncommon usecase for hadoop. Hadoop's bread and butter is > all about streaming through large datasets that do not fit in memory. Also > your shuffle-sort-spill is going to play havoc on and file system based > cache. The distributed cache roughly fits this role except that it does not > persist after a job. > > Replicating content to N nodes also is not a hard problem to tackle (you > can hack up a content delivery system with ssh+rsync) and get similar > results.The approach often taken has been to keep data that is accessed > repeatedly and fits in memory in some other system > (hbase/cassandra/mysql/whatever). > > Edward > > > On Mon, Jan 16, 2012 at 11:33 AM, Rita wrote: > > > Thanks. I believe this is a good feature to have for clients especially > if > > you are reading the same large file over and over. > > > > > > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon wrote: > > > > > There is some work being done in this area by some folks over at UC > > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it > > > has been published quite yet, but the title of the project is "PACMan" > > > -- I expect it will be published soon. > > > > > > -Todd > > > > > > On Sat, Jan 14, 2012 at 5:30 PM, Rita wrote: > > > > After reading this article, > > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I > > was > > > > wondering if there was a filesystem cache for hdfs. For example, if a > > > large > > > > file (10gigabytes) was keep getting accessed on the cluster instead > of > > > keep > > > > getting it from the network why not storage the content of the file > > > locally > > > > on the client itself. A use case on the client would be like this: > > > > > > > > > > > > > > > > > > > > dfs.client.cachedirectory > > > > /var/cache/hdfs > > > > > > > > > > > > > > > > > > > > dfs.client.cachesize > > > > in megabytes > > > > 10 > > > > > > > > > > > > > > > > Any thoughts of a feature like this? > > > > > > > > > > > > -- > > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--