Re: hadoop filesystem cache

2012-01-17 Thread Rita
My intention isn't to make it a mandatory feature just as an option. Keeping data locally on a filesystem as a method of Lx cache is far better than getting it from the network and the cost of fs buffer cache is much cheaper than a RPC call. On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo

NameNode per-block memory usage?

2012-01-17 Thread Otis Gospodnetic
Hello, How much memory/JVM heap does NameNode use for each block? I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find a ton of concrete numbers, just these two: http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block? http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB

Re: NameNode per-block memory usage?

2012-01-17 Thread Joey Echeverria
How much memory/JVM heap does NameNode use for each block? I don't remember the exact number, it also depends on which version of Hadoop you're using http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? It's 1 GB for every *million* objects (files, blocks, etc.). This is a

Re: NameNode per-block memory usage?

2012-01-17 Thread Edward Capriolo
On Tue, Jan 17, 2012 at 10:08 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, How much memory/JVM heap does NameNode use for each block? I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find a ton of concrete numbers, just these two:

RE: How to find out whether a node is Overloaded from Cpu utilization ?

2012-01-17 Thread Bill Brune
Hi, The significant factor in cluster loading is memory, not CPU. Hadoop views the cluster only with respect to memory and cares not about CPU utilization or Disk saturation. If you run too many TaskTrackers, you risk memory overcommit where the Linux OOM will come out of the closet and

Re: effect on data after topology change

2012-01-17 Thread Todd Lipcon
Hi Ravi, You'll probably need to up the replication level of the affected files and then drop it back down to the desired level. Current versions of HDFS do not automatically repair rack policy violations if they're introduced in this manner. -Todd On Mon, Jan 16, 2012 at 3:53 PM, rk vishu

Re: effect on data after topology change

2012-01-17 Thread rk vishu
Thank you very much Todd. I hope futute versions of hadoop rebalcer will include this check. I have one more question. If we are in the process of setting up additional nodes incrementally in different rack (say rack-2) and rack-2 size is only 25% of rack-1, how would data be balanced (with

org.apache.hadoop.mapred.Merger merge bug

2012-01-17 Thread Bai Shen
I think I've found a bug in the Merger code for Hadoop. When the Map job runs, it creates spill files based on io.sort.mb. It then sorts io.sort.factor files at a time in order to create an output file that's passed to the reduce job. The higher these two settings are configured, the more

race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Stan Rosenberg
Hi, This posting is essentially about a bug, but it is also related to a programmatic idiom endemic to hadoop. Thus, I am posting to 'common-user' as opposed to 'common-dev'; if the latter is more appropriate, please let me know. Also, I checked jira and was unable to find a bug match.

Error Using Hadoop .20.2/Mahout.4 on Solr 3.4

2012-01-17 Thread Peyman Mohajerian
Hi Guys, I'm running a Clojure code inside Solr 3.4 that makes call to Mahout .4 for some text clustering job. Due to some issues with Clojure I had to put all the jar files in the solr war file ('WEB-INF/lib'). I also made sure to put hadoop core and mapreduce config xml files in the same

Re: NameNode per-block memory usage?

2012-01-17 Thread M. C. Srivas
Konstantin's paper http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf mentions that on average a file consumes about 600 bytes of memory in the name-node (1 file object + 2 block objects). To quote from his paper (see page 9) .. in order to store 100 million files

Re: race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Brock Noland
Hi, tl;dr DUMMY should not be static. On Tue, Jan 17, 2012 at 3:21 PM, Stan Rosenberg srosenb...@proclivitysystems.com wrote: class MyKeyT implements WritableComparableT {  private String ip; // first part of the key  private final static Text DUMMY = new Text();  ...  public void

Re: race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Stan Rosenberg
On Tue, Jan 17, 2012 at 6:38 PM, Brock Noland br...@cloudera.com wrote: This class is invalid. A single thread will be executing your mapper or reducer but there will be multiple threads (background threads such as the SpillThread) creating MyKey instances which is exactly what you are seeing.

RE: How to find out whether a node is Overloaded from Cpu utilization ?

2012-01-17 Thread ArunKumar
Guys ! So can i say that if memory usage is more than say 90 % the node is overloaded. If so, what can be that threshold percent value or how can we find it ? Arun -- View this message in context:

Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this

Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, whatever I do, I can't make it work, that

Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to

Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
When using S3 you do not need to run any component of HDFS at all. It is meant to be an alternate FS choice. You need to run only MR. The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on how to go about specifying your auth details to S3, either directly via the fs.default.name URI

Is it possible to set how many map slots to use on each job submission?

2012-01-17 Thread edward choi
Hi, I often run into situations like this: I am running a very heavy job(let's say job 1) on a hadoop cluster(which takes many hours). Then something comes up that needs to be done very quickly(let's say job 2). Job 2 only takes a couple of hours when executed on hadoop. But it will take a couple

Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and hdfs-site.xml, so which one do you put it in? Thank you (and good night Central Time:) mark On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote: When using S3 you do not need to run any

Re: Is it possible to set how many map slots to use on each job submission?

2012-01-17 Thread Harsh J
Edward, You need to invest in configuring a non-FIFO scheduler. FairScheduler may be what you are looking for. Take a look at http://hadoop.apache.org/common/docs/current/fair_scheduler.html for the docs. On 18-Jan-2012, at 12:27 PM, edward choi wrote: Hi, I often run into situations like

Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
Ah sorry about missing that. Settings would go in core-site.xml (hdfs-site.xml will no longer be relevant anymore, once you switch to using S3). On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote: That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and

Container size

2012-01-17 Thread raghavendhra rahul
Hi, What is the minimum size of the container in hadoop yarn. capability.setmemory(xx);