I am interested in calculating the number of nodes and amount of
storage per node I need in order to meet an initial usable storage
volume for HDFS. For example, if I want a setup with 50 TBytes of
usable storage, how many data nodes would I need and how much storage
is in each node. Please includ
Why not just have a higher number of mappers? Why split into multiple
jobs? Any particular case that you think this will be useful in?
On 9/9/09, Rakhi Khatwani wrote:
> Hi,
>Suppose i have a hdfs file with 10,000 entries. and i want my job to
> process 100 records at one time (to minimiz
Hi,
I also have similar kind of question.
Is it possible for a job to start reading a file ( start split ) from a
specific position in file rather than beginning. Idea is I have some
information in a file, first part of information can only be read in
sequence not parallel, so I get this da
Hi,
Here are 2 possible ways for static data sharing-
1. Using distributed cache- refer
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#DistributedCache
2. Using JobConf object-
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html#set%28java.
Greetings,
I'm trying to setup HDFS on a single-node cluster. I've tried several
tutorials, but all of them lead to the same problem.
After the setup I can check that everything is working using:
jps
...
17421 NameNode
17519 DataNode
17611 SecondaryNameNode
17685 JobTracker
17778 TaskTracker
18425
Arvind Sharma wrote:
hmmm... I had seen some exceptions (don't remember which one) on MacOS. There was missing JSR-223 engine on my machine.
Not sure why on Linux distribution you would see this error
From: Ted Yu
To: common-user@hadoop.apache.org
Sent
Hi,
I have a problem building libhdfs.so in Hadoop 0.20.1
>From what I could see, the build process has changed significantly in 0.20.0
>(as mentioned in http://issues.apache.org/jira/browse/HADOOP-3344), and "ant
>compile-libhdfs -Dlibhdfs=1" can't be used anymore.
I'm trying to use standard
Hi everyone!
I tried hadoop cluster setup on 4 pcs. I ran into a problem about
hadoop-common. when I ran the command'bin/hadoop jar hadoop-*-examples.jar
wordcount input output', the map tasks could complet quickly, but the reduce
phase took very long to complet. I think it's caused by the config,
Maybe someone can correct me if im wrong, but this is what I did to get
libhdfs on 0.20.0 to build:
NOTE: on debian, you need to apply a patch:
https://issues.apache.org/jira/browse/HADOOP-5611
Compile libhdfs: ant compile-contrib -Dlibhdfs=1
Then to install libhdfs in the local hadoop lib: an
If you really want to share read/write data you can use memcached server or
file based database like Tokyocabinet or BDB
--
Thanks & Regards,
Chandra Prakash Bhagtani,
On Thu, Sep 10, 2009 at 2:22 PM, indoos wrote:
>
> Hi,
> Here are 2 possible ways for static data sharing-
> 1. Using distrib
You can try running JobTracker on some other port. This port might me in
use.
--
Thanks & Regards,
Chandra Prakash Bhagtani,
On Thu, Sep 10, 2009 at 2:58 AM, gcr44 wrote:
>
> All,
>
> I'm setting up my first hadoop full cluster. I did the cygwin thing and
> everything works. I'm having probl
Thanks for the response.
I have already tried moving JobTracker to several different ports always
with the same result.
Chandraprakash Bhagtani wrote:
>
> You can try running JobTracker on some other port. This port might me in
> use.
>
> --
> Thanks & Regards,
> Chandra Prakash Bhagtani,
>
On 9/9/09 5:21 PM, "Chad Potocky" wrote:
> I am interested in calculating the number of nodes and amount of
> storage per node I need in order to meet an initial usable storage
> volume for HDFS. For example, if I want a setup with 50 TBytes of
> usable storage, how many data nodes would I need
gcr44 wrote:
Thanks for the response.
I have already tried moving JobTracker to several different ports always
with the same result.
Chandraprakash Bhagtani wrote:
You can try running JobTracker on some other port. This port might me in
use.
--
Thanks & Regards,
Chandra Prakash Bhagtani,
On
Just an idea ... we've had trouble with Hadoop using internal instead of
external addresses on Ubuntu. The data nodes can't connect to the
namenode if it's listening on an internal address. On the namenode can
you run 'netstat -na' ? What address is the namenode daemon bound to?
Steve L
Hi there,
I have three questions:
1) I have written a MapReduce app that implements MapRunnable as we needed
to be able to control the threads and share information between them.
What number of map tasks should I specify in my conf file? Should it be the
same as the number of nodes?
2) When w
Here's a setup I've used:
- configuration data distributed to the mappers / reducers using the
JobConf object
- BDBs (stored in ZIP packages on the HDFS) used for read/write data
across stages. The data flow organized so a single mapper modifies a
single database per stage, to avoid concurrency
Hi,
I am getting different values for START_TIME, FINISH_TIME regarding
the exact same task when looking at the history log of a sorter job.
E.g greping for a particular reduce task in the history log:
masternode:/home/hadoop-git # grep -r
"task_200909031613_0002_r_02"
/home/vliaskov/hadoop-
I found Chukwa to be an interesting project.
Can someone give a little detail on how freshly generated log files are
handled ?
I have downloaded the source code. So a few filenames would help me better
understand.
Thanks
On Sep 10, 2009, at 10:44 AM, Ted Yu wrote:
I found Chukwa to be an interesting project.
Can someone give a little detail on how freshly generated log files
are
handled ?
I have downloaded the source code. So a few filenames would help me
better
understand.
Please redirect your questi
I've spent the last couple days cleaning up our current collection of shell
scripts that run our jobs, and got me wondering how much difficulty it is to
get Oozie up and running, or if there were a better (i.e. even simpler)
system out there.
Has anyone outside Yahoo used Oozie? Does it work with
Samprita-
I'm assuming at this point that you have gmond installed on all nodes in
your cluster. Correct me if I'm assuming too much.
The next step is to configure gmond. See the man page gmond.conf.
# man gmond.conf
In particular, you are going to need to change the udp_send_channel. By
def
That's 99% correct. If you want/need to run different versions of HDFS on
the two different clusters, then you can't use hdfs:// protocol to access
both of them in the same command. In this case, use hdfs://bla/ for the
source fs and *hftp*://bla2/ for the dest fs.
- Aaron
On Tue, Sep 8, 2009 at
What do you do with the data on a failing disk when you replace it?
Our support person comes in occasionally, and often replaces several
disks when he does. These are disks that have not yet failed, but
firmware indicates that failure is imminent. We need to be able to
migrate our data off these
Hi David,
Unfortunately there's really no way to do what you're hoping to do in an
automatic way. You can move the block files (including their .meta files)
from one disk to another. Do this when the datanode daemon is stopped.
Then, when you start the datanode daemon, it will scan dfs.data.dir
I think decommissioning the node and replacing the disk is a cleaner
approach. That's what I'd recommend doing as well..
On 9/10/09, Alex Loddengaard wrote:
> Hi David,
> Unfortunately there's really no way to do what you're hoping to do in an
> automatic way. You can move the block files (inclu
Thank you both. That's what we did today. It seems fairly reasonable
when a node has a few disks, say 3-5. However, at some sites, with
larger nodes, it seems more awkward. When a node has a dozen or more
disks (as used in the larger terasort benchmarks), migrating the data
off all the disks is
I would recommend taking the node down without decommissioning, replacing
the disk, then bringing the node back up. After 10-20 minutes the name node
will figure things out and start replicating the missing blocks.
Rebalancing would be a good idea to fill the new disk. You could even do
this with
What would happen if you did this without taking the node down? For
example, if you have hot-swappable drives in the node(s)? Will the
running datanode process pick up the fact that an entire partition goes
missing and reappears empty a few minutes later?
Or would it be better to at least shut o
I think that would be a bit too rude. The datanode scans the data partition
when it comes up. Better to give it the benefit of a bounce so that it can
inform the name node of the new state of affairs.
On Thu, Sep 10, 2009 at 8:16 PM, Michael Thomas wrote:
> Will the
> running datanode process p
Vasilis Liaskovitis wrote:
Hi,
I am getting different values for START_TIME, FINISH_TIME regarding
the exact same task when looking at the history log of a sorter job.
E.g greping for a particular reduce task in the history log:
masternode:/home/hadoop-git # grep -r
"task_200909031613_0002_r_0
-- Forwarded message --
From: ql yan
Date: 2009/9/10
Subject: about hadoop:reduce could not read map's output
To: common-user-h...@hadoop.apache.org, common-user@hadoop.apache.org
Hi everyone!
I tried hadoop cluster setup on 4 pcs. I ran into a problem about
hadoop-common. when I
Hi everyone!
I tried hadoop cluster setup on 4 pcs. I ran into a
problem about hadoop-common. when I ran the command'bin/hadoop jar
hadoop-*-examples.jar wordcount input output', the map tasks could
complet quickly, but the reduce phase took very long to complet. I
think it's caused by the config,
-Original Message-
From: ext David B. Ritch [mailto:david.ri...@gmail.com]
Sent: Friday, September 11, 2009 11:07
To: common-user@hadoop.apache.org
Subject: Re: Decommissioning Individual Disks
Thank you both. That's what we did today. It seems fairly reasonable
when a node has a few
34 matches
Mail list logo