Re: maybe a bug in hadoop?

2009-06-10 Thread Tim Wintle
On Wed, 2009-06-10 at 13:56 +0100, stephen mulcahy wrote:
 Hi,
 
 I've been running some tests with hadoop in pseudo-distributed mode. My 
 config includes the following in conf/hdfs-site.xml
 
 property
   namedfs.data.dir/name
   value/hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4/value
 /property
 
 When I started running hadoop I was disappointed to see that only 
 /hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME 
 now contained the following new directories
 
 HADOOP_HOME/ /hdfs/disk2
 HADOOP_HOME/ /hdfs/disk3
 HADOOP_HOME/ /hdfs/disk4
 
 I guess any leading spaces should be stripped out of the data dir names? 
 Or maybe there is a reason for this behaviour. I thought I should 
 mention it just in case.

I thought I'd double check...

$touch \ file\ with\ leading\ space
$touch file\ without\ leading\ space
$ ls
 file with leading space
file without leading space

... so stripping it out could mean you couldn't enter some valid
directory names (but who has folders starting with a space?)

I'm sure my input wasn't very useful, but just a comment.

Tim Wintle




Re: Amazon Elastic MapReduce

2009-04-03 Thread Tim Wintle
On Fri, 2009-04-03 at 11:19 +0100, Steve Loughran wrote:
 True, but this way nobody gets the opportunity to learn how to do it 
 themselves, which can be a tactical error one comes to regret further 
 down the line. By learning the pain of cluster management today, you get 
 to keep it under control as your data grows.

Personally I don't want to have to learn (and especially not support in
production) the EC2 / S3 part, so it does sound appealing.

On a side note, I'd hope that at some point they give some control over
the priority of the overall job - on the level of you can boot up these
machines whenever you want, or boot up these machines now - that
should let them manage the load on their hardware and reduce costs
(which I'd obviously expect them to pass on the users of low-priority
jobs). I'm not sure how that would fit into the give me 10 nodes
method at the moment.

 
 I am curious what bug patches AWS will supply, for they have been very 
 silent on their hadoop work to date.

I'm hoping it will involve security of EC2 images, but not expectant.





Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Tim Wintle
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
   1) I can pick the language that offers a different programming
 paradigm (e.g. I may choose functional language, or logic programming
 if they suit the problem better).  In fact, I can even chosen Erlang
 at the map() and Prolog at the reduce().  Mix and match can optimize
 me more.

Agreed (as someone who has written mappers/reducers in Python, perl,
shell script and Scheme before).



Re: Does reduce start only after the map is completed?

2009-03-07 Thread Tim Wintle
On Sat, 2009-03-07 at 23:03 +0300, Mithila Nagendra wrote:
 Hey all
 Im using the hadoop version 0.18.3, and was wondering if the reduce phase
 starts only after the mapping is completed? Is it required that the Map
 phase is a 100% done, or can it be programmed in such a way that the reduce
 starts earlier?

As I understand it, the reducers have three phases:

 1) Copy Data from the mappers (Shuffle)
 2) Sort the data on the reducer (by key)
 3) Actually run the data through the function you've defined.

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reducer.html

The Reducer tasks/processes start as soon as they are able to (I
believe), and copying data and sorting happens while there may still be
mappers running.

Stage (3) cannot be run until stage (2) is completed, which can
obviously not happen until all the mappers are complete.

In my experience, I haven't found this a major issue (especially if
there are many times more mappers than machines), since the shuffle and
sort stages take significant time and effort anyway.


Tim Wintle



Re: Off topic: web framework for high traffic service

2009-03-04 Thread Tim Wintle
On Wed, 2009-03-04 at 23:14 +0100, Lukáš Vlček wrote:
 Sorry for off topic question
It is very off topic.

 Any ideas, best practices, book recomendations, papers, tech talk links ...
I found this a nice little book:
http://developer.yahoo.net/blog/archives/2008/11/allspaw_capacityplanning.html

and this a nice techtalk on why the servers aren't the most important
thing to worry about.
http://www.youtube.com/watch?v=BTHvs3V8DBA




Re: the question about the common pc?

2009-02-23 Thread Tim Wintle
On Mon, 2009-02-23 at 11:14 +, Steve Loughran wrote:
 Dumbo provides py support under Hadoop:
   http://wiki.github.com/klbostee/dumbo
   https://issues.apache.org/jira/browse/HADOOP-4304

Ooh, nice - I hadn't seen dumbo. That's far cleaner than the python
wrapper to streaming I'd hacked together.

I'm probably going to be using hadoop more again in the near future so
I'll bookmark that, thanks Steve.

Personally I only need text based records, so I'm fine using a wrapper
around streaming

Tim Wintle



Re: the question about the common pc?

2009-02-20 Thread Tim Wintle
On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote:
 I've been doing MapReduce work over small in-memory datasets 
 using Erlang,  which works very well in such a context.

I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like

 mkfifo mypipe1
 mkfifo mypipe2
 awk '0 == NR % 2'  mypipe1 | ./mapper | sort  map_out_1
  awk '0 == (NR+1) % 2'  mypipe2 | ./mapper | sort  map_out_2
 ./get_lots_of_data | tee mypipe1  mypipe2

(wait until it's done... or send a signal from the get_lots_of_data
process on completion if it's a cronjob)

 sort -m map_out* | ./reducer  reduce_out

works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.

Tim Wintle



Re: Re:Re: the question about the common pc?

2009-02-18 Thread Tim Wintle
On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
 Hadoop is designed for High performance computing equipment, but claimed to 
 be fit for daily pcs.

The phrase High Performance Computing equipment makes me think of
infiniband, fibre all over the place etc.


Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
special hardware you couldn't find in a standard pc. That doesn't mean
you should run it on pcs that are being used for other things though.

I found that hadoop ran ok on fairly old hardware - a load of old
power-pc macs (running linux) churned through some jobs quickly, and
I've actually run it on people's office machines during the nights (not
on Windows). I did end up having to add an extra switch in for the part
of the network that was only 100 mbps to get the throughput though.

Of course ideally you would be running it on a rack of 1u servers, but
that's still normally standard pc hardware.






Re: architecture diagram

2008-10-01 Thread Tim Wintle
I normally find the intermediate stage of copying data to the reducers
from the mappers to be a significant step - but that's not over the best
quality switches...

The mappers and reducers work on the same boxes, close to the data.  


On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote:
 
 It really depends on your job I think.  Often reduce steps can be the
 bottleneck if you want a single output file (one reducer).
 
 Hope this helps.
 
 Alex



Re: Accessing input files from different servers

2008-09-12 Thread Tim Wintle

 a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) 
 in all those machines where the log files are getting created ? And then do a 
 file get from the central HDFS server` ?

I'd install hadoop on the machine, but you don't have to start any nodes
there - you can log onto a cluster running elsewhere using the command
line tools to put / get data from the cluster.

From what I recall, this is actually better than running nodes locally
as if you put data on locally, the blocks will tend to be posted to the
local machine.


Tim


signature.asc
Description: This is a digitally signed message part


Re: HDFS Vs KFS

2008-08-21 Thread Tim Wintle
I haven't used KFS, but I believe a major difference is that you can
(apparently) mount KFS as a standard device under Linux, allowing you to
read and write directly to it without having to re-compile the
application (as far as I know that's not possible with HDFS, although
the last time I installed HDFS was 0.16)

... It is definitely much newer.


On Fri, 2008-08-22 at 01:35 +0800, rae l wrote:
 On Fri, Aug 22, 2008 at 12:34 AM, Wasim Bari [EMAIL PROTECTED] wrote:
 
  KFS is also another Distributed file system implemented in C++. Here you can
  get details:
 
  http://kosmosfs.sourceforge.net/
 
 Just from the basic information:
 
 http://sourceforge.net/projects/kosmosfs
 
 # Developers : 2
 # Development Status : 3 - Alpha
 # Intended Audience : Developers
 # Registered : 2007-08-30 21:05
 
 and from the history of subversion repository:
 
 http://kosmosfs.svn.sourceforge.net/viewvc/kosmosfs/trunk/
 
 I think it's just not stable and not widely used as HDFS:
 
 * HDFS is stable and production level available.
 
 This maybe not totally right and I'm waiting someone more familiar to
 KFS to talk about this.


signature.asc
Description: This is a digitally signed message part


Re: anybody know how to run sshd in LEOPARD

2008-06-17 Thread Tim Wintle
I've set hadoop up on a load of Intel Macs before - I think that sshd is
what Apple call Remote Log-in or something like that - it was a GUI
option to allow an account to log in remotely.

Hope that helps

On Tue, 2008-06-17 at 14:27 +0800, j.L wrote:
 i wanna try hadoop, but i can't run sshd when i use macbook(leopard)
 



RE: Questions regarding configuration parameters...

2008-02-22 Thread Tim Wintle
I have had exactly the same problem with using the command line to cat
files - they can take for ages, although I don't know why. Network
utilisation does not seem to be the bottleneck, though.

(Running 0.15.3)

Is the slow part of the reduce while you are waiting for the map data to
copy over to the reducers? I believe there was a bug prior to 0.16.0
that could leave you waiting for a long time if mappers had been too
slow to respond to previous requests (even if they were completely free
now)


On Thu, 2008-02-21 at 21:51 -0800, C G wrote:
 My performance problems fall into 2 categories:

   1.  Extremely slow reduce phases - our map phases march along at impressive 
 speed, but during reduce phases most nodes go idle...the active machines 
 mostly clunk along at 10-30% CPU.  Compare this to the map phase where I get 
 all grid nodes cranking away at  100% CPU.  This is a vague explanation I 
 realize.

   2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  
 Frequently I'll be iterating over a list of HDFS files cat-ing them into one 
 file to bulk load into a database.  Many times I'll see one of the 
 copies/cats sit for anywhere from 2-5 minutes.  During that time no data is 
 transferred, all nodes are idle, and absolutely nothing is written to any of 
 the logs.  The file sizes being copied are relatively small...less than 1G 
 each in most cases.

   Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm 
 sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but 
 the long pauses during a dfs command line operation seems like a bug to me.  
 Unfortunately I've not seen anybody else report this.

   Any thoughts/ideas most welcome...

   Thanks,
   C G
   
 
 Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
   
  The default value are 2 so you might only see 2 cores used by Hadoop per
  node/host.
 
 that's 2 each for map and reduce. so theoretically - one could fully utilize 
 a 4 core box with this setting. in practice - a little bit of 
 oversubscription (3 each on a 4 core) seems to be working out well for us 
 (maybe overlapping some compute and io - but mostly we are trading off for 
 higher # concurrent jobs against per job latency).
 
 unlikely that these settings are causing slowness in processing small amounts 
 of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
 consumption when map task is running .. etc.
 
 
 -Original Message-
 From: Andy Li [mailto:[EMAIL PROTECTED]
 Sent: Thu 2/21/2008 2:36 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Questions regarding configuration parameters...
 
 Try the 2 parameters to utilize all the cores per node/host.
 
 
 
 mapred.tasktracker.map.tasks.maximum
 7
 The maximum number of map tasks that will be run
 simultaneously by a task tracker.
 
 
 
 
 
 
 mapred.tasktracker.reduce.tasks.maximum
 7
 The maximum number of reduce tasks that will be run
 simultaneously by a task tracker.
 
 
 
 
 The default value are 2 so you might only see 2 cores used by Hadoop per
 node/host.
 If each system/machine has 4 cores (dual dual core), then you can change
 them to 3.
 
 Hope this works for you.
 
 -Andy
 
 
 On Wed, Feb 20, 2008 at 9:30 AM, C G 
 wrote:
 
  Hi All:
 
  The documentation for the configuration parameters mapred.map.tasks and
  mapred.reduce.tasks discuss these values in terms of number of available
  hosts in the grid. This description strikes me as a bit odd given that a
  host could be anything from a uniprocessor to an N-way box, where values
  for N could vary from 2..16 or more. The documentation is also vague about
  computing the actual value. For example, for mapred.map.tasks the doc
  says .a prime number several times greater.. I'm curious about how people
  are interpreting the descriptions and what values people are using.
  Specifically, I'm wondering if I should be using core count instead of
  host count to set these values.
 
  In the specific case of my system, we have 24 hosts where each host is a
  4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
  value 173, as that is a prime number which is near 7*24. For
  mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
  Is this what was intended?
 
  Beyond curiousity, I'm concerned about setting these values and other
  configuration parameters correctly because I am pursuing some performance
  issues where it is taking a very long time to process small amounts of data.
  I am hoping that some amount of tuning will resolve the problems.
 
  Any thoughts and insights most appreciated.
 
  Thanks,
  C G
 
 
 
  -
  Never miss a thing. Make Yahoo your homepage.
 
 
 
 

 -
 Looking for last minute shopping deals?  Find them fast with Yahoo! Search.



Re: Calculations involve large datasets

2008-02-22 Thread Tim Wintle
Have you seen PIG:
http://incubator.apache.org/pig/

It generates hadoop code and is more query like, and (as far as I
remember) includes union, join, etc.

Tim

On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
 Hi,
 
 I'm currently looking into how to better scale the performance of our
 calculations involving large sets of financial data.  It is currently using
 a series of Oracle SQL statements to perform the calculations.  It seems to
 me that the MapReduce algorithm may work in this scenario.  However, I
 believe would need to perform some denormalization of data in order for this
 to work.  Do I have to?  Or is there a good way to implement joins within
 the Hadoop framework efficiently?
 
 Thanks,
 Chuck



Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Tim Wintle
I would certainly appreciate being able to watch them online too, and
they would help spread the word about hadoop - think of all the people
who watch Google's Techtalks (am I allowed to say the G word around
here?).



On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote:
 Online webcast/recorded video would be really appreciated by lot of people.
 Please post the content online! (not only you can target much greater
 audience but you can significantly save on break/lunch/beer food budget :-).
 Lukas
 
 On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED] wrote:
 
  The registration page for the Hadoop summit is now up:
  http://developer.yahoo.com/hadoop/summit/
 
  Space is limited, so please sign up early if you are interested in
  attending.
 
  About the summit:
  Yahoo! is hosting the first summit on Apache Hadoop on March 25th in
  Sunnyvale. The summit is sponsored by the Computing Community Consortium
  (CCC) and brings together leaders from the Hadoop developer and user
  communities. The speakers will cover topics in the areas of extensions
  being developed for Hadoop, case studies of applications being built and
  deployed on Hadoop, and a discussion on future directions for the
  platform.
 
  Agenda:
  8:30-8:55 Breakfast
  8:55-9:00 Welcome to Yahoo!  Logistics - Ajay Anand, Yahoo!
  9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo!
  9:30-10:00 Pig - Chris Olston, Yahoo!
  10:00-10:30 JAQL - Kevin Beyer, IBM
  10:30-10:45 Break
  10:45-11:15 DryadLINQ - Michael Isard, Microsoft
  11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei
  Zaharia, UC Berkeley
  11:45-12:15 Zookeeper - Ben Reed, Yahoo!
  12:15-1:15 Lunch
  1:15-1:45 Hbase - Michael Stack, Powerset
  1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf
  2:15-2:45 Hive - Joydeep Sen Sarma, Facebook
  2:45-3:00 Break
  3:00-3:20 Building Ground Models of Southern California - Steve
  Schossler, David O'Hallaron, Intel / CMU
  3:20-3:40 Online search for engineering design content - Mike Haley,
  Autodesk
  3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo!
  4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland /
  Christophe Bisciglia, Google
  4:30-4:45 Break
  4:45-5:30 Panel on future directions
  5:30-7:00 Happy hour
 
  Look forward to seeing you there!
  Ajay
 
  -Original Message-
  From: Bradford Stephens [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 20, 2008 9:17 AM
  To: core-user@hadoop.apache.org
  Subject: Re: Hadoop summit / workshop at Yahoo!
 
  Hrm yes, I'd like to make a visit as well :)
 
  On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote:
 Hey All:
  
 Is this going forward?  I'd like to make plans to attend and the
  sooner I can get plane tickets the happier the bean counters will be
  :-).
  
 Thx,
 C G
  
  Ajay Anand wrote:
   
Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are seeing
  from
developers in a broad range of organizations, this seems like a
  good
time to get together and brief each other on the progress that is
being
made.
   
   
   
We would like to cover topics in the areas of extensions being
developed
for Hadoop, innovative applications being built and deployed on
Hadoop,
and future extensions to the platform. Some of the speakers who
  have
already committed to present are from organizations such as IBM,
Intel,
Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and
we are
actively recruiting other leaders in the space.
   
   
   
If you have an innovative application you would like to talk about,
please let us know. Although there are limitations on the amount of
time
we have, we would love to hear from you. You can contact me at
[EMAIL PROTECTED]
   
   
   
Thanks and looking forward to hearing about your cool apps,
   
Ajay
   
   
   
   
   
   
--
View this message in context:
  http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15
  393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15393386.html
Sent from the Hadoop lucene-users mailing list archive at
  Nabble.com.
   
  
  
  
  
  
  
   -
   Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
  Try it now.
 
 
 
 



Re: URLs contain non-existant domain names in machines.jsp

2008-02-10 Thread Tim Wintle
I agree, this is a really annoying problem - most of the job appears to
work, but unfortunately the reduce stage doensn't normally work.

Interestingly, when hadoop runs on OSX it seems to set the hostname as
the ip (or sets a hostname through zeroconfig). Would be useful if we
could use just ip address, though (especially for dynamic clusters
where machines are being added / removed fairly often)


On Sat, 2008-02-09 at 21:11 +0530, Ben Kucinich wrote:
 I made a small mistake describing my problem. There is no 192.168.1.8.
 There is only one machine, 192.168.101.8. I'll describe my problem
 again.
 
 1. I have set up a single-node cluster on 192.168.101.8. It is an Ubuntu 
 server.
 
 2. There is no entry for 192.168.101.8 in the DNS server. However, the
 hostname is set to be hadoop in this server. But this is only local.
 If I ping hadoop locally, it works. But if I ping hadoop or ping
 hadoop.domain.example.com from another system it doesn't work. From
 another system I have to ping 192.168.101.8. So, I hope I have made it
 clear that hadoop.domain.example.com does not exist in our DNS server.
 
 3. domain.example.com is only a dummy example. Of course the actual
 name is the domain name of our organization.
 
 4. I started hadoop on this server with the command, bin/hadoop
 namenode -format; bin/start-all.sh
 
 5. jps showed all the processes started successfully.
 
 6. Here is my hadoop-site.xml
 
 configuration
 
 property
   namefs.default.name/name
   value192.168.101.8:9000/value
   description/description
 /property
 
 property
   namemapred.job.tracker/name
   value192.168.101.8:9001/value
   description/description
 /property
 
 property
   namedfs.replication/name
   value1/value
   description/description
 /property
 
 /configuration
 
 7. I am running a few ready examples present in
 hadoop-0.15.3-examples.jar, especially, the wordcount one. I am also
 putting some files into the DFS from remote systems, such as,
 192.168.101.100, 192.168.101.101, etc. But these remote systems are
 not slaves.
 
 8. From a remote system, I try to access:-
 http://192.168.101.8:50030/machines.jsp
 
 It showed:-
 
 Name  Host# running tasks FailuresSeconds since heartbeat
 tracker_hadoop.domain.example.com:/127.0.0.1:4545
 hadoop.domain.example.com   0   0   9
 
 Now, when I click on
 tracker_hadoop..domain.example.com:/127.0.0.1:4545 link it takes me to
 http://hadoop.domain.example.com:50060/. But it gives error in the
 browser because of reason mentioned in point 2. I don't want it to use
 the hostname to form those links. I want it to use the IP address,
 192.168.101.8 to form the links. Is it possible?
 
 On Feb 9, 2008 7:49 PM, Amar Kamat [EMAIL PROTECTED] wrote:
  Ben Kucinich wrote:
   I have a Hadoop running on a master node 192.168.1.8. fs.default.name
   is 192.168.101.8:9000 and mapred.job.tracker is 192.168.101.8:9001.
  
  
  Actually the masters are the nodes where the JobTracker and the NameNode
  are running i.e 192.168.101.8 in your case.
  192.168.1.8 would be your client node, the node from where the jobs are
  submitted.
   I am accessing it's web pages on port 50030 from another machine. I
   visited http://192.168.101.8:50030/machines.jsp. It showed:-
  
   Name  Host# running tasks FailuresSeconds since heartbeat
   tracker_hadoop.domain.example.com:/127.0.0.1:4545 
   hadoop.domain.example.com   0   0   9
  
  The tacker-name is tracker_tracker-hostname:port where hostname is
  obtained from the DNS nameserver passed by
  'mapred.tasktracker.dns.nameserver' in conf/hadoop-default.xml. So I
  guess in your case hadoop.domain.example.com
  is the name obtained from the DNS nameserver for that node. Can you
  provide more details on the xml parameters you have
  changed in conf directory. Also can you provide more details on how you
  are starting your hadoop.
  Amar
 
   Now, when I click on
   tracker_hadoop..domain.example.com:/127.0.0.1:4545 link it takes me to
   http://hadoop.domain.example.com:50060/. But there is no DNS entry for
   hadoop in our DNS server. So, I get error in browser. hadoop is just
   the locally set name in the master node. From my machine I can't
   access the master node as hadoop. I have to access it as IP address
   192.168.101.8. So, this link fails. Is there a way I can set it so
   that, it doesn't use names but only IP address in forming this link?
  
 
 



RE: Starting up a larger cluster

2008-02-07 Thread Tim Wintle
You can set which nodes are allowed to connect in hadoop-site.xml - it's
useful to be able to connect from nodes that aren't in the slaves file
so that you can put in input data direct from another machine that's not
part of the cluster, or add extra machines on the fly (just make sure
they're routing correctly first!). You can also run jobs direct from
your workstation (without having to scp your code, ssh etc)

If you look through the shell scripts you should see exactly what the
slaves file is used for. It's fairly easy to modify the scripts to start
a single rack, so you should be able to bring up machines when you need
them.

Tim


On Thu, 2008-02-07 at 12:24 -0800, Jeff Eastman wrote:
 Hi Ben,
 
 I've been down this same path recently and I think I understand your
 issues:
 
 1) Yes, you need the hadoop folder to be in the same location on each
 node. Only the master node actually uses the slaves file, to start up
 DataNode and JobTracker daemons on those nodes.
 2) If you did not specify any slave nodes on your master node then the
 start-all did not create these processes on any nodes other than master.
 This node can be accessed and the dfs written to from other machines as
 you can do but there is no replication since there is only one DataNode.
 
 Try running jps on your other nodes to verify this, and access the
 NameNode web page to see what slaves you actually have running. By
 adding your slave nodes to the slaves file on your master and bouncing
 hadoop you should see a big difference in the size of your cluster.
 
 Good luck, it's an adventure,
 Jeff
 
 -Original Message-
 From: Ben Kucinich [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, February 07, 2008 10:52 AM
 To: core-user@hadoop.apache.org
 Subject: Starting up a larger cluster
 
 In the Nutch wiki, I was reading this
 http://wiki.apache.org/hadoop/GettingStartedWithHadoop
 
 I have problems understanding this section:
 
 == Starting up a larger cluster ==
 
  Ensure that the Hadoop package is accessible from the same path on
 all nodes that are to be included in the cluster. If you have
 separated configuration from the install then ensure that the config
 directory is also accessible the same way.
  Populate the slaves file with the nodes to be included in the
 cluster. One node per line.
 
 1) Does the first line mean, that I have to place the hadoop folder in
 exactly the same location on every slave node? For example, if I put
 hadoop home directory in my /usr/local/ in master node it should be
 present in /user/local/ in all the slave nodes as well?
 
 2) I ran start-all.sh in one node (192.168.1.2) with fs.default.name
 as 192.168.1.2:9000 and mapred.job.tracker as 192.168.1.2:9001. So, I
 believe this will play the role of master node. I did not populate the
 slaves file with any slave nodes. But in many other systems,
 192.168.1.3, 192.168.1.4, etc. I made the same settings in
 hadoop-site.xml. So I believe these are slave nodes. Now in the slave
 nodes I ran commands like bin/hadoop -dfs put dir newdir and the
 newdir was created in the DFS. I wonder how the master node allowed
 the slave nodes to put the files even though I did not populate the
 slaves file.
 
 Please help me with these queries since I am new to Hadoop.
 



Re: Namenode fails to replicate file

2008-02-07 Thread Tim Wintle
Doesn't the -setrep command force the replication to be increased
immediately?

./hadoop dfs -setrep [replication] path

(I may have misunderstood)


On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
 
 Chris Kline reported a problem in early January where a file which had too
 few replicated blocks did not get replicated until a DFS restart.
 
 I just saw a similar issue.  I had a file that had a block with 1 replica (2
 required) that did not get replicated.  I changed the number of required
 replicates, but nothing caused any action.  Changing the number of required
 replicas on other files got them to be replicated.
 
 I eventually copied the file to temp, deleted the original and moved the
 copy back to the original place.  I was also able to read the entire file
 which shows that the problem was not due to slow reporting from a down
 datanode.
 
 This happened just after I had a node failure which was why I was messing
 with replication at all.  Since I was in the process of increasing the
 replication on nearly 10,000 large files, my log files are full of other
 stuff, but I am pretty sure that there is a bug here.
 
 This was on a relatively small cluster with 13 data nodes.
 
 It also brings up a related issue that has come up before in that there are
 times when you may want to increase the number of replicas of a file right
 NOW.  I don't know of any way to force this replication.  Is there such a
 way?