Hello Kumar,
Going forward, you could send cloudera distribution specific questions
to cdh-u...@cloudera.org directly as well (
groups.google.com/a/cloudera.org ).
Glad to know your rpm installed successfully, and yes SunJDK rpm is
required to be installed -- OpenJDK wouldn't work as well as
It is just for information's sake (cause it can be computed with the
data collected). The space is accounted just to let you know that
there's something being stored on the DataNodes apart from just the
HDFS data, in case you are running out of space.
On Fri, Jul 8, 2011 at 10:18 AM, Sagar Shukla
Thanks Harsh. My first question still remains unanswered - Why does it require
non-DFS storage?. If it is cache data then it should get flushed from the
system after certain interval of time. And if it is useful data then it should
have been part of used DFS data.
I have a setup in which DFS
It looks like both datanodes are trying to serve data out of the smae
directory. Is there any chance that both datanodes are using the same NFS mount
for the dfs.data.dir?
If not, what I would do is delete the data from ${dfs.data.dir} and then
re-format the namenode. You'll lose all of your
I did not get that question, require? Its not a count of something
HDFS uses, just outside of it (logs, other apps, OS, w/e that uses
other space would show up in that metric). Am not sure I understand
you? Isn't 250 GB already utilized looking at your disks?
On Fri, Jul 8, 2011 at 4:54 PM, Sagar
non DFS storage is not required, it is provided as information only to shown
how the storage is being used.
The available storage on the disks is used for both DFS and non DFS
(mapreduce shuffle output and any other files that could be on the disks).
See if you have unnecessary files or shuffle
Hi Suresh / Harsh,
Thanks for the details. Let me go over the setup again and get some
understanding of what you are saying.
Thanks,
Sagar
-Original Message-
From: Suresh Srinivas [mailto:srini30...@gmail.com]
Sent: Friday, July 08, 2011 5:43 PM
To: common-user@hadoop.apache.org
Hey guys,
Thanks all of you for your help.
Joey,
I tweaked my MapReduce to serialize/deserialize only escencial values and
added a combiner and that helped a lot. Previously I had a domain object
which was being passed between Mapper and Reducer when I only needed a
single value.
Esteban,
I
Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints
Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.
-Joey
On Fri, Jul 8, 2011 at 11:25 AM, Juan P. gordoslo...@gmail.com wrote:
Here's another
Dear Hadoop Users,
I am very new on hadoop, I am just trying to run the tutorials.
Currently I am trying to run the Pseudo-Distributed Operation
(http://hadoop.apache.org/common/docs/stable/single_node_setup.html).
I have found that there are another users that have had this same
problem. But
Hey there,
I've written some scripts to check dfs disk space, number of datanodes,
number of tasktrackers, heap in use...
I'm with hadoop 0.20.2 and to do that I use the DFSClient and JobClient
APIs.
I do things like:
JobClient jc = new JobClient(socketJT, conf);
ClusterStatus clusterStatus =
Shouldn't be a problem. But making sure, you disconnect the connection from
this monitoring client might be helpful at peak loads.
-Bharath
From: Marc Sturlese marc.sturl...@gmail.com
To: hadoop-u...@lucene.apache.org
Sent: Friday, July 8, 2011 10:49 AM
Hi Moon,
The periodic block report is constructed entirely from info in memory, so
there is no complete scan of the filesystem for this purpose. The periodic
block report defaults to only sending once per hour from each datanode, and
each DN calculates a random start time for the hourly cycle
Slow start is an important parameter. Definitely impacts job runtime. My
experience in the past has been that, setting this parameter to too low or
setting to too high can have issues with job latencies. If you are trying to
run same job then its easy to set right value but if your cluster is
15 matches
Mail list logo