Re: Clarification: CDH3 - installation JDK dependency

2011-07-08 Thread Harsh J
Hello Kumar, Going forward, you could send cloudera distribution specific questions to cdh-u...@cloudera.org directly as well ( groups.google.com/a/cloudera.org ). Glad to know your rpm installed successfully, and yes SunJDK rpm is required to be installed -- OpenJDK wouldn't work as well as

Re: Difference between DFS Used and Non-DFS Used

2011-07-08 Thread Harsh J
It is just for information's sake (cause it can be computed with the data collected). The space is accounted just to let you know that there's something being stored on the DataNodes apart from just the HDFS data, in case you are running out of space. On Fri, Jul 8, 2011 at 10:18 AM, Sagar Shukla

RE: Difference between DFS Used and Non-DFS Used

2011-07-08 Thread Sagar Shukla
Thanks Harsh. My first question still remains unanswered - Why does it require non-DFS storage?. If it is cache data then it should get flushed from the system after certain interval of time. And if it is useful data then it should have been part of used DFS data. I have a setup in which DFS

Re: HTTP Error

2011-07-08 Thread Joey Echeverria
It looks like both datanodes are trying to serve data out of the smae directory. Is there any chance that both datanodes are using the same NFS mount for the dfs.data.dir? If not, what I would do is delete the data from ${dfs.data.dir} and then re-format the namenode. You'll lose all of your

Re: Difference between DFS Used and Non-DFS Used

2011-07-08 Thread Harsh J
I did not get that question, require? Its not a count of something HDFS uses, just outside of it (logs, other apps, OS, w/e that uses other space would show up in that metric). Am not sure I understand you? Isn't 250 GB already utilized looking at your disks? On Fri, Jul 8, 2011 at 4:54 PM, Sagar

Re: Difference between DFS Used and Non-DFS Used

2011-07-08 Thread Suresh Srinivas
non DFS storage is not required, it is provided as information only to shown how the storage is being used. The available storage on the disks is used for both DFS and non DFS (mapreduce shuffle output and any other files that could be on the disks). See if you have unnecessary files or shuffle

RE: Difference between DFS Used and Non-DFS Used

2011-07-08 Thread Sagar Shukla
Hi Suresh / Harsh, Thanks for the details. Let me go over the setup again and get some understanding of what you are saying. Thanks, Sagar -Original Message- From: Suresh Srinivas [mailto:srini30...@gmail.com] Sent: Friday, July 08, 2011 5:43 PM To: common-user@hadoop.apache.org

Re: Cluster Tuning

2011-07-08 Thread Juan P.
Hey guys, Thanks all of you for your help. Joey, I tweaked my MapReduce to serialize/deserialize only escencial values and added a combiner and that helped a lot. Previously I had a domain object which was being passed between Mapper and Reducer when I only needed a single value. Esteban, I

Re: Cluster Tuning

2011-07-08 Thread Juan P.
Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints

Re: Cluster Tuning

2011-07-08 Thread Joey Echeverria
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0. 1.0 means the maps have to completely finish before the reduce starts copying any data. I often run jobs with this set to .90-.95. -Joey On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote: Here's another

jobtracker.info could only be replicated to 0 nodes, instead of 1

2011-07-08 Thread Gustavo Pabon
Dear Hadoop Users, I am very new on hadoop, I am just trying to run the tutorials. Currently I am trying to run the Pseudo-Distributed Operation (http://hadoop.apache.org/common/docs/stable/single_node_setup.html). I have found that there are another users that have had this same problem. But

check namenode, jobtracker, datanodes and tasktracker status

2011-07-08 Thread Marc Sturlese
Hey there, I've written some scripts to check dfs disk space, number of datanodes, number of tasktrackers, heap in use... I'm with hadoop 0.20.2 and to do that I use the DFSClient and JobClient APIs. I do things like: JobClient jc = new JobClient(socketJT, conf); ClusterStatus clusterStatus =

Re: check namenode, jobtracker, datanodes and tasktracker status

2011-07-08 Thread Bharath Mundlapudi
Shouldn't be a problem. But making sure, you disconnect the connection from this monitoring client might be helpful at peak loads. -Bharath From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, July 8, 2011 10:49 AM

Re: Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

2011-07-08 Thread Matt Foley
Hi Moon, The periodic block report is constructed entirely from info in memory, so there is no complete scan of the filesystem for this purpose. The periodic block report defaults to only sending once per hour from each datanode, and each DN calculates a random start time for the hourly cycle

Re: Cluster Tuning

2011-07-08 Thread Bharath Mundlapudi
Slow start is an important parameter. Definitely impacts job runtime. My experience in the past has been that, setting this parameter to too low or setting to too high can have issues with job latencies. If you are trying to run same job then its easy to set right value but if your cluster is