I was thinking not for M/R, but for the actual daemons:
When I go and start up a daemon (like below). They all use the same
hadoop-env.sh. Which allows you to only set the HADOOP_HEAPSIZE once..
not differently for each daemon-type..
bin/hadoop-daemon.sh start namenode
bin/hadoop-daemon.sh
If you need to set the java_options for mem., you can do this via configure in
your MR job.
-Original Message-
From: Fernando Padilla [mailto:f...@alum.mit.edu]
Sent: Wednesday, July 22, 2009 9:11 AM
To: common-user@hadoop.apache.org
Subject: best way to set memory
So.. I want to have d
Another approach, arguably a bit better in my opinion, is through the
hadoop-gpl-compression project (http://code.google.com/p/hadoop-gpl-compression/
). It also incorporates Johan Oskarsson's H-4640 patch. A detailed
description on how to use it with lzo-less hadoop distribution can be
foun
So.. I want to have different memory profiles for
NameNode/DataNode/JobTracker/TaskTracker.
But it looks like I only have one environment variable to modify,
HADOOP_HEAPSIZE, but I might be running more than one on a single
box/deployment/conf directory.
Is there a proper way to set the memo
Thanks Aaron, for the quick response.
Best Regards,
Danny
-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com]
Sent: Tuesday, July 21, 2009 9:10 PM
To: common-user@hadoop.apache.org
Subject: Re: native-lzo library not available issue with terasort
Native LZO support was r
Native LZO support was removed from Hadoop due to licensing
restrictions. See
http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/
for a writeup on how to reenable it in your local build.
- Aaron
On Tue, Jul 21, 2009 at 7:02 PM, Gross, Danny wrote:
> Hello,
>
>
I'd say that this is your problem right here:
Heap Size is 972.5 MB / 972.5 MB (100%)
As suspected, your very small block size + many files has completely
filled the NameNode heap. All bets are off as to what Hadoop will do
at this point.
Potential solutions:
1) increase the HADOOP_HEAPSIZE para
Hm. What version of Hadoop are you running? Have you modified the
log4j.properties file in other ways? The logfiles generated by Hadoop
should, by default, switch to a new file every day, appending the
previous day's date to the closed log file (e.g.,
"hadoop-hadoop-datanode-jargon.log.2009-07-13"
Hello,
I've been running terasort on multiple cluster configurations, and
attempted to duplicate some of the configuration settings that Yahoo!
used for the Minute Sort.
In particular, I set the mapred.map.output.compression.codec property to
value "org.apache.hadoop.io.compress.LzoCodec"
And regarding your desire to set things on the command line: If your
program implements Tool and is launched via ToolRunner, you can
specify "-D myparam=myvalue" on the command line and it'll
automatically put that binding in the JobConf created for the tool,
retrieved via getConf().
- Aaron
On T
Hello all:
I'm the administrator of an existing, shared HPC cluster. The
majority of the users of this cluster use MPI jobs, and do not wish to
change to hadoop or other systems. However, one of my users wishes to
run hadoop jobs on this cluster.
In order to accommodate the variety of users, ou
On Mon, Jul 20, 2009 at 5:02 PM, Aaron Kimball wrote:
> There's likely another gotcha regarding the fact that various logs and job
> config files are written to the _logs directory under the output directory.
That can be turned off by setting hadoop.job.history.user.location to none.
> This m
Configurations, which includes JobConf's, are string to string maps.
Therefore, when you submit the job, you would do:
jobConf.set("my-string", args[2]);
and in the map:
jobConf.get("my-string")
of course it is better to use a static final string rather than a literal,
but it works either way.
My jobs in hadoop get lot of "rpc timeout" error when the cluster is busy.
Indeed that's the what we don't want to see because the jobs were running fine,
just need to wait a little long to process.
Is there a way to configure rpc time out value so that it can wait longer?
Thanks
Hi,
Recently we ran out of disk space on the hadoop machine, and on
investigation we found it was the hadoop log4j logs.
In the log4j.properties file I have set the hadoop.root.logger=ERROR
yet I still the daily hadoop-hadoopadmin-*.log files with INFO level
logging in them. These never seem to g
It happens in "reduce"
2009/7/21 George Pang
> Hi users,
>
> Please help with this one - I got an error at running a two - node cluster
> on big files, the error is :
>
> 2365222 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher -
> Error message from task (ma
On Tue, Jul 21, 2009 at 3:26 AM, Steve Loughran wrote:
> Todd Lipcon wrote:
>
> On Sat, Jul 4, 2009 at 9:08 AM, David B. Ritch > >wrote:
>>
>> Thanks, Todd. Perhaps I was misinformed, or misunderstood. I'll make
>>> sure I close files occasionally, but it's good to know that the only
>>> real
Hi Andraz,
First, thanks for the contribution. Could you create a JIRA ticket and
upload the code there? Due to ASF restrictions, all contributions must be
attached to a JIRA so you can officially grant permission to include the
code. The JIRA will also allow others to review and comment on the co
Hallo guys,
I would like to create a variable in the JobConf in order to set its value
from the command line
and get it again in the map function.
After a bit of reading here and there I understood that I have to do the
following::
*In the class Map, before the map function I have done this:*
*p
An OS doesn't take much disk and doesn't require many operations.
The key segregation that you might like to do is separating local
intermediate storage from hadoop data. Separating the OS doesn't make much
difference.
On Tue, Jul 21, 2009 at 8:58 AM, Ryan Smith wrote:
> I would imagine os mgm
Hi users,
Please help with this one - I got an error at running a two - node cluster
on big files, the error is :
2365222 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher -
Error message from task (map)
task_200907201602_0002_m_10*java.io.IOException:
Spill
Hi all,
We are facing issues with an external application when it tries to write
data into HDFS using FSDataOutputStream. We are using hadoop-0.18.2
version. The code works perfectly fine as long as the data nodes are
doing well. If the data nodes are unavailable due to some reason (No
space le
On Jul 21, 2009, at 8:28 AM, Ted Dunning wrote:
There are already several such efforts.
Pig has PigMix
Hadoop has terasort and likely some others as well.
Hadoop has the terasort, and grid mix. There is even a new version of
the grid mix coming out. Look at:
https://issues.apache.org/ji
Thanks, that makes sense about the large cluster reboot. What about using
usb keys for the os? I would imagine os mgmt would be easier and you could
use all the disk space in the machines for data, any ideas on this?
On Tue, Jul 21, 2009 at 6:22 AM, Steve Loughran wrote:
> Ryan Smith wrote:
>
There are already several such efforts.
Pig has PigMix
Hadoop has terasort and likely some others as well.
On Tue, Jul 21, 2009 at 3:14 AM, Steve Loughran wrote:
> 2. I'm thinking of starting a little benchmarking subproject under the
> Hadoop umbrella, just thinking of a witty enough title. "
On Tue, Jul 21, 2009 at 9:44 AM, Tim
Nelson wrote:
> I have a question that I think I already know the answer to but I would like
> verification. I have a demo cluster comprised of two master nodes and eight
> slaves (all 1x1.2 Ghz cpu / 1 Gig Ram / 1x250 Gig Sata 7200 rpm hard
> drives).
[...]
More drives will certainly help in lots of ways but no new drive should fail
within a week. I'd assume you have either power or heating issues.
Dejan
On Mon, Jul 20, 2009 at 10:44 PM, Tim Nelson <
had...@enigmasupercomputing.com> wrote:
> I have a question that I think I already know the answer
Andrew,
Thanks for the information. Can you give me some numbers on transfer
rates from S3 into HDFS? Processing the content in place in S3 isn't
an option for us.
Larry
On Fri, Jul 17, 2009 at 5:57 PM, Hitchcock, Andrew wrote:
> Hi Larry,
>
> I'm an engineer with Elastic MapReduce. The latency
Ravi Phulari wrote:
Hello Roman ,
If you have huge cluster then its good to have JobTracker and NameNode running
on different machines .
If your cluster is small enough ( ~<20-30 machines ) then you can run
JobTracker and NameNode on same machines .
Again it depends on hardware configuration .
Boyu Zhang wrote:
Dear all,
Is there any other virtual machines that I can use to provide a Hadoop
cluster over a physical cluster?
1. You can bring up Hadoop under VMWare, VirtualBox, Xen. There are
problems with Centos5.x/RHEL5 under VirtualBox (some clock issue
generates 100% load e
Todd Lipcon wrote:
On Sat, Jul 4, 2009 at 9:08 AM, David B. Ritch wrote:
Thanks, Todd. Perhaps I was misinformed, or misunderstood. I'll make
sure I close files occasionally, but it's good to know that the only
real issue is with data recovery after losing a node.
Just to be clear, there
Ryan Smith wrote:
Anyone use DRBL for hadoop nodes?
http://drbl.sourceforge.net/
Just wondering if its common or found to not be a good idea.
Also I found this project, was released today. hopefully some code will
appear soon:
The trouble with full diskless boot is the big-datacentre-restar
Ted Dunning wrote:
> It is very unusual to have enough power to fill a rack with servers.
Check
> your power and heat loading calculations.
>
> You might consider also a new box that Sillicon Mechanics has. It is
> essentially four 1U servers in a 2U package. Each of the four
servers has
>
If it is useful to anyone:
here's a codec to support getting data from .tar.gz
Basically the assumption is that instead of having just one text file
gzipped, you have many text files tared and gzipped. Therefore it just
concatenates all the files inside .tar.gz archive.
The source was based on Gz
This error occurs when several reducers are unable to fetch the given map
output ( attempt_200907202331_0001_m_01_0 in your example).
I am guessing that there is a configuration issue in your setup -- the
reducers are not able to contact/transfer map outputs from the TaskTracker.
The TT log on
And I also run the random read benchmark provided in
https://issues.apache.org/jira/browse/HDFS-236.
Here is the result:
09/07/21 14:55:45 INFO mapred.FileInputFormat:Date & time: Tue
Jul 21 14:55:45 CST 2009
09/07/21 14:55:45 INFO mapred.FileInputFormat:Number of files: 10
09/
Hi users,
I got this "Too many fetch failures" in the following error message:
*09/07/20 23:33:39 INFO mapred.JobClient: map 100% reduce 16%
09/07/20 23:46:22 INFO mapred.JobClient: Task Id :
attempt_200907202331_0001_m_01_0, Status : FAILED
Too many fetch-failures
09/07/20 23:46:37 INFO map
37 matches
Mail list logo