Re: Distributed Agent

2009-04-15 Thread Rasit OZDAS
Take a look at this topic:

http://dsonline.computer.org/portal/site/dsonline/menuitem.244c5fa74f801883f1a516106bbe36ec/index.jsp?&pName=dso_level1_about&path=dsonline/topics/agents&file=about.xml&xsl=generic.xsl&;

2009/4/14 Burak ISIKLI :
> Hello everyone;
> I want to write a distributed agent program. But i can't understand one thing 
> that what's difference between client-server program and agent program? Pls 
> help me...
>
>
>
>
> 
> Burak ISIKLI
> Dumlupinar University
> Electric & Electronic - Computer Engineering
>
> http://burakisikli.wordpress.com
> http://burakisikli.blogspot.com
> 
>
>
>
>



-- 
M. Raşit ÖZDAŞ


Re: Modeling WordCount in a different way

2009-04-15 Thread Pankil Doshi
On Wed, Apr 15, 2009 at 1:26 AM, Sharad Agarwal wrote:

>
>
> > I am trying complex queries on hadoop and in which i require more than
> one
> > job to run to get final result..results of job one captures few joins of
> the
> > query and I want to pass those results as input to 2nd job and again do
> > processing so that I can get final results.queries are such that I cant
> do
> > all types of joins and filterin in job1 and so I require two jobs.
> >
> > right now I write results of job 1 to hdfs and read dem for job2..but
> thats
> > take unecessary IO time.So was looking for something that I can store my
> > results of job1 in memory and use them as input for job 2.
> >
> > do let me know if you need any  more details.
> How big is your input and output data ?

And my total data is of 7.8 gb out of which for Job 1 i use around 3
gb.output of job1 is of about 1gb and I use this output as input to job 2.


> How many nodes you are using?

Well Right now due to lack of Resources I have only 4 nodes each dual core
processors with 1GB og ram and about 80gb hard
disk in  each..

>
> What is your job runtime?

My first jobs takes long time after reaching 90% of reduce phase as it does
in-memory merge sort and so that is also an big issue.I will have to arrange
for more memory for my clusters I suppose.

I will have look at jvm reuse feature. thanks



> Pankil


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-15 Thread Andy Liu
Not sure if comparing Hadoop to databases is an apples to apples
comparison.  Hadoop is a complete job execution framework, which collocates
the data with the computation.  I suppose DBMS-X and Vertica do that to some
certain extent, by way of SQL, but you're restricted to that.  If you want
to say, build a distributed web crawler, or a complex data processing
pipeline, Hadoop will schedule those processes across a cluster for you,
while Vertica and DBMS-X only deal with the storage of the data.

The choice of experiments seemed skewed towards DBMS-X and Vertica.  I think
everybody is aware that Map-Reduce is inefficient for handling SQL-like
queries and joins.

It's also worth noting that I think 4 out of the 7 authors either currently
or at one time work with Vertica (or c-store, the precursor to Vertica).

Andy

On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
wrote:

> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: guigermog...@hotmail.com
> homepage: http://germoglio.googlepages.com
>


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-15 Thread Jonathan Gray
I agree with you, Andy.

This seems to be a great look into what Hadoop MapReduce is not good at.

Over in the HBase world, we constantly deal with comparisons like this to
RDBMSs, trying to determine if one is better than the other.  It's a false
choice and completely depends on the use case.

Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database!  It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing.  The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority. 
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
technology :)

In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
applications including batch log processing, web crawling, and number of
machine learning and natural language processing jobs... These may not be
tasks that DBMS-X or Vertica would be good at, if even capable of them,
but all things that I would include under "Large-Scale Data Analysis".

Would have been really interesting to see how things like Pig, Hive, and
Cascading would stack up against DBMS-X/Vertica for very complex,
multi-join/sort/etc queries, across a broad spectrum of use cases and
dataset/result sizes.

There are a wide variety of solutions to the problems out there.  It's
important to know the strengths and weaknesses of each, so a bit
unfortunate that this paper set the stage as it did.

JG

On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
> Not sure if comparing Hadoop to databases is an apples to apples
> comparison.  Hadoop is a complete job execution framework, which
> collocates the data with the computation.  I suppose DBMS-X and Vertica do
> that to some certain extent, by way of SQL, but you're restricted to that.
> If you want
> to say, build a distributed web crawler, or a complex data processing
> pipeline, Hadoop will schedule those processes across a cluster for you,
> while Vertica and DBMS-X only deal with the storage of the data.
>
> The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
> think everybody is aware that Map-Reduce is inefficient for handling
> SQL-like
> queries and joins.
>
> It's also worth noting that I think 4 out of the 7 authors either
> currently or at one time work with Vertica (or c-store, the precursor to
> Vertica).
>
>
> Andy
>
>
> On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
> wrote:
>
>
>> (Hadoop is used in the benchmarks)
>>
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control flow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some have called
>> MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore, we
>> evaluate both kinds of systems in terms of performance and de- velopment
>> complexity. To this end, we define a benchmark con- sisting of a
>> collection of tasks that we have run on an open source version of MR as
>> well as on two parallel DBMSs. For each task, we measure each system’s
>> performance for various degrees of par- allelism on a cluster of 100
>> nodes. Our results reveal some inter- esting trade-offs. Although the
>> process to load data into and tune the execution of parallel DBMSs took
>> much longer than the MR system, the observed performance of these DBMSs
>> was strikingly better. We speculate about the causes of the dramatic
>> performance difference and consider implementation concepts that future
>> sys- tems should take from both kinds of architectures.
>>
>>
>> --
>> Guilherme
>>
>>
>> msn: guigermog...@hotmail.com
>> homepage: http://germoglio.googlepages.com
>>
>>
>



Re: hadoop-a small doubt

2009-04-15 Thread Pankil Doshi

Hey ,
You can do that.That system should have same usrname like those of cluster
and ofcourse it  should be able to ssh name node.Also it should have hadoop
and its hadoop-site.xml should be similar .Then u can access namenode,hdfs
etc.

if you are willing to see the web interface that can be done easily using
any system.

deepya wrote:
> 
> Hi,
>I am SreeDeepya doing MTech in IIIT.I am working on a project named
> cost effective and scalable storage server.I configured a small hadoop
> cluster with only two nodes one namenode and one datanode.I am new to
> hadoop.
> I have a small doubt.
> 
> Can a system not in the hadoop cluster access the namenode or the
> datanodeIf yes,then can you please tell me the necessary
> configurations that has to be done.
> 
> Thanks in advance.
> 
> SreeDeepya
> 

-- 
View this message in context: 
http://www.nabble.com/hadoop-a-small-doubt-tp22764615p23061794.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



How to submit a project to Hadoop/Apache

2009-04-15 Thread Tarandeep Singh
Hi,

Can anyone point me to a documentation which explains how to submit a
project to Hadoop as a subproject? Also, I will appreciate if someone points
me to the documentation on how to submit a project as Apache project.

We have a project that is built on Hadoop. It is released to the open source
community under GPL license but we are thinking of submitting it as a Hadoop
or Apache project. Any help on how to do this is appreciated.

Thanks,
Tarandeep


Re: Map-Reduce Slow Down

2009-04-15 Thread Mithila Nagendra
The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
following in it:

2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node19/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250;
compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
/
2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).
2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 3 time(s).
2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 4 time(s).
2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 5 time(s).
2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 6 time(s).
2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 7 time(s).
2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 8 time(s).
2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 9 time(s).
2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
192.168.0.18:54310 not available yet, Z...
2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).
2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 3 time(s).
2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 4 time(s).
2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 5 time(s).
2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 6 time(s).
2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 7 time(s).
2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 8 time(s).
2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 9 time(s).
2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
192.168.0.18:54310 not available yet, Z...
2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).


Hmmm I still cant figure it out..

Mithila


On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra  wrote:

> Also, Would the way the port is accessed change if all these node are
> connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
> systems we worked with earlier didnt have a gateway.
> Mithila
>
> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra wrote:
>
>> Aaron: Which log file do I look into - there are alot of them. Here s what
>> the error looks like:
>> [mith...@node19:~]$ cd hadoop
>> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 0 time(s).
>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 1 time(s).
>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 2 time(s).
>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to s

Re: Map-Reduce Slow Down

2009-04-15 Thread Mithila Nagendra
The log file runs into thousands of line with the same message being
displayed every time.

On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra  wrote:

> The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> following in it:
>
> 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = node19/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> /
> 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Z...
> 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Z...
> 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>
>
> Hmmm I still cant figure it out..
>
> Mithila
>
>
> On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra wrote:
>
>> Also, Would the way the port is accessed change if all these node are
>> connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
>> systems we worked with earlier didnt have a gateway.
>> Mithila
>>
>> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra wrote:
>>
>>> Aaron: Which log file do I look into - there are alot of them. Here s
>>> what the error looks like:
>>> [mith...@node19:~]$ cd hadoop
>>> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
>>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 0 time(s).
>>>

Re: Map-Reduce Slow Down

2009-04-15 Thread Ravi Phulari
Looks like your NameNode is down .
Verify if hadoop process are running (   jps should show you all java running 
process).
If your hadoop process are running try restarting your hadoop process .
I guess this problem is due to your fsimage not being correct .
You might have to format your namenode.
Hope this helps.

Thanks,
--
Ravi


On 4/15/09 10:15 AM, "Mithila Nagendra"  wrote:

The log file runs into thousands of line with the same message being
displayed every time.

On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra  wrote:

> The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> following in it:
>
> 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = node19/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> /
> 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Z...
> 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Z...
> 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>
>
> Hmmm I still cant figure it out..
>
> Mithila
>
>
> On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra wrote:
>
>> Also, Would the way the port is accessed change if all these node are
>> connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
>> systems we worked with earlier didnt have a gateway.
>> Mithi

Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Otis Gospodnetic

This is how things get into Apache Incubator: http://incubator.apache.org/
But the rules are, I believe, that you can skip the incubator and go straight 
under a project's wing (e.g. Hadoop) if the project PMC approves.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Tarandeep Singh 
> To: core-user@hadoop.apache.org
> Sent: Wednesday, April 15, 2009 1:08:38 PM
> Subject: How to submit a project to Hadoop/Apache
> 
> Hi,
> 
> Can anyone point me to a documentation which explains how to submit a
> project to Hadoop as a subproject? Also, I will appreciate if someone points
> me to the documentation on how to submit a project as Apache project.
> 
> We have a project that is built on Hadoop. It is released to the open source
> community under GPL license but we are thinking of submitting it as a Hadoop
> or Apache project. Any help on how to do this is appreciated.
> 
> Thanks,
> Tarandeep



Re: Using 3rd party Api in Map class

2009-04-15 Thread Aaron Kimball
That certainly works, though if you plan to upgrade the underlying library,
you'll find that copying files with the correct versions into
$HADOOP_HOME/lib rapidly gets tedious, and subtle mistakes (e.g., forgetting
one machine) can lead to frustration.

When you consider the fact that you're using a Hadoop cluster to process and
transfer around GBs of data on the low end, the difference between a 10 MB
and a 20 MB job jar starts to look meaningless. Putting other jars in a lib/
directory inside your job jar keeps the version consistent and doesn't
clutter up a shared directory on your cluster (assuming there are other
users).

- Aaron

On Tue, Apr 14, 2009 at 11:15 AM, Farhan Husain  wrote:

> Hello,
>
> I got another solution for this. I just pasted all the required jar files
> in
> lib folder of each hadoop node. In this way the job jar is not too big and
> will require less time to distribute in the cluster.
>
> Thanks,
> Farhan
>
> On Mon, Apr 13, 2009 at 7:22 PM, Nick Cen  wrote:
>
> > create a directroy call 'lib' in your project's root dir, then put all
> the
> > 3rd party jar in it.
> >
> > 2009/4/14 Farhan Husain 
> >
> > > Hello,
> > >
> > > I am trying to use Pellet library for some OWL inferencing in my mapper
> > > class. But I can't find a way to bundle the library jar files in my job
> > jar
> > > file. I am exporting my project as a jar file from Eclipse IDE. Will it
> > > work
> > > if I create the jar manually and include all the jar files Pellet
> library
> > > has? Is there any simpler way to include 3rd party library jar files in
> a
> > > hadoop job jar? Without being able to include the library jars I am
> > getting
> > > ClassNotFoundException.
> > >
> > > Thanks,
> > > Farhan
> > >
> >
> >
> >
> > --
> > http://daily.appspot.com/food/
> >
>


Re: Map-Reduce Slow Down

2009-04-15 Thread Aaron Kimball
Hi,

I wrote a blog post a while back about connecting nodes via a gateway. See
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/

This assumes that the client is outside the gateway and all
datanodes/namenode are inside, but the same principles apply. You'll just
need to set up ssh tunnels from every datanode to the namenode.

- Aaron

On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari wrote:

> Looks like your NameNode is down .
> Verify if hadoop process are running (   jps should show you all java
> running process).
> If your hadoop process are running try restarting your hadoop process .
> I guess this problem is due to your fsimage not being correct .
> You might have to format your namenode.
> Hope this helps.
>
> Thanks,
> --
> Ravi
>
>
> On 4/15/09 10:15 AM, "Mithila Nagendra"  wrote:
>
> The log file runs into thousands of line with the same message being
> displayed every time.
>
> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra 
> wrote:
>
> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> > following in it:
> >
> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> > /
> > STARTUP_MSG: Starting DataNode
> > STARTUP_MSG:   host = node19/127.0.0.1
> > STARTUP_MSG:   args = []
> > STARTUP_MSG:   version = 0.18.3
> > STARTUP_MSG:   build =
> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> > /
> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> > 192.168.0.18:54310 not available yet, Z...
> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> > 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> > 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> > 192.168.0.18:54310 not available yet, Z...
> > 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(

Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Aaron Kimball
Tarandeep,

You might want to start by releasing your project as a "contrib" module for
Hadoop. The overhead there is much easier -- just get it compiliing in the
contrib/ directory, file a JIRA ticket on Hadoop Core, and attach your patch
:)

- Aaron

On Wed, Apr 15, 2009 at 10:29 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

>
> This is how things get into Apache Incubator: http://incubator.apache.org/
> But the rules are, I believe, that you can skip the incubator and go
> straight under a project's wing (e.g. Hadoop) if the project PMC approves.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Tarandeep Singh 
> > To: core-user@hadoop.apache.org
> > Sent: Wednesday, April 15, 2009 1:08:38 PM
> > Subject: How to submit a project to Hadoop/Apache
> >
> > Hi,
> >
> > Can anyone point me to a documentation which explains how to submit a
> > project to Hadoop as a subproject? Also, I will appreciate if someone
> points
> > me to the documentation on how to submit a project as Apache project.
> >
> > We have a project that is built on Hadoop. It is released to the open
> source
> > community under GPL license but we are thinking of submitting it as a
> Hadoop
> > or Apache project. Any help on how to do this is appreciated.
> >
> > Thanks,
> > Tarandeep
>
>


Re: Map-Reduce Slow Down

2009-04-15 Thread Mithila Nagendra
Hi Aaron
I will look into that thanks!

I spoke to the admin who overlooks the cluster. He said that the gateway
comes in to the picture only when one of the nodes communicates with a node
outside of the cluster. But in my case the communication is carried out
between the nodes which all belong to the same cluster.

Mithila

On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball  wrote:

> Hi,
>
> I wrote a blog post a while back about connecting nodes via a gateway. See
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>
> This assumes that the client is outside the gateway and all
> datanodes/namenode are inside, but the same principles apply. You'll just
> need to set up ssh tunnels from every datanode to the namenode.
>
> - Aaron
>
>
> On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari wrote:
>
>> Looks like your NameNode is down .
>> Verify if hadoop process are running (   jps should show you all java
>> running process).
>> If your hadoop process are running try restarting your hadoop process .
>> I guess this problem is due to your fsimage not being correct .
>> You might have to format your namenode.
>> Hope this helps.
>>
>> Thanks,
>> --
>> Ravi
>>
>>
>> On 4/15/09 10:15 AM, "Mithila Nagendra"  wrote:
>>
>> The log file runs into thousands of line with the same message being
>> displayed every time.
>>
>> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra 
>> wrote:
>>
>> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
>> > following in it:
>> >
>> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode:
>> STARTUP_MSG:
>> > /
>> > STARTUP_MSG: Starting DataNode
>> > STARTUP_MSG:   host = node19/127.0.0.1
>> > STARTUP_MSG:   args = []
>> > STARTUP_MSG:   version = 0.18.3
>> > STARTUP_MSG:   build =
>> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
>> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
>> > /
>> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
>> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
>> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
>> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
>> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
>> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
>> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
>> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
>> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
>> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at
>> node18/
>> > 192.168.0.18:54310 not available yet, Z...
>> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
>> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
>> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
>> > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
>> > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
>> > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
>> > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
>> > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retryi

Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Tarandeep Singh
Thanks Aaron... yeah it sounds like a much easier approach :)

On Wed, Apr 15, 2009 at 11:00 AM, Aaron Kimball  wrote:

> Tarandeep,
>
> You might want to start by releasing your project as a "contrib" module for
> Hadoop. The overhead there is much easier -- just get it compiliing in the
> contrib/ directory, file a JIRA ticket on Hadoop Core, and attach your
> patch
> :)
>
> - Aaron
>
> On Wed, Apr 15, 2009 at 10:29 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
> >
> > This is how things get into Apache Incubator:
> http://incubator.apache.org/
> > But the rules are, I believe, that you can skip the incubator and go
> > straight under a project's wing (e.g. Hadoop) if the project PMC
> approves.
> >
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> > > From: Tarandeep Singh 
> > > To: core-user@hadoop.apache.org
> > > Sent: Wednesday, April 15, 2009 1:08:38 PM
> > > Subject: How to submit a project to Hadoop/Apache
> > >
> > > Hi,
> > >
> > > Can anyone point me to a documentation which explains how to submit a
> > > project to Hadoop as a subproject? Also, I will appreciate if someone
> > points
> > > me to the documentation on how to submit a project as Apache project.
> > >
> > > We have a project that is built on Hadoop. It is released to the open
> > source
> > > community under GPL license but we are thinking of submitting it as a
> > Hadoop
> > > or Apache project. Any help on how to do this is appreciated.
> > >
> > > Thanks,
> > > Tarandeep
> >
> >
>


Datanode Setup

2009-04-15 Thread jpe30

I'm setting up a Hadoop cluster and I have the name node and job tracker up
and running.  However, I cannot get any of my datanodes or tasktrackers to
start.  Here is my hadoop-site.xml file...









  hadoop.tmp.dir
  /home/hadoop/h_temp
  A base for other temporary directories.



  dfs.data.dir
  /home/hadoop/data



  fs.default.name
  192.168.1.10:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.
  true



  mapred.job.tracker
  192.168.1.10:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  



  dfs.replication
  0
  Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  






and here is the error I'm getting...

2009-04-15 14:00:48,208 INFO org.apache.hadoop.dfs.DataNode:
STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250;
compiled by 'ndaley' on Thu Jan 22 23:12:0$
/
2009-04-15 14:00:48,355 ERROR org.apache.hadoop.dfs.DataNode:
java.net.UnknownHostException: myhost: myhost
at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:249)
at org.apache.hadoop.dfs.DataNode.(DataNode.java:223)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071)
at
org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156)

2009-04-15 14:00:48,356 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException:
myhost: myhost
/

-- 
View this message in context: 
http://www.nabble.com/Datanode-Setup-tp23064660p23064660.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Datanode Setup

2009-04-15 Thread Mithila Nagendra
Hi,
The replication factor has to be set to 1. Also for you dfs and job tracker
configuration you should insert the name of the node rather than the i.p
address.

For instance:
 192.168.1.10:54310

can be:

 master:54310

The nodes can be renamed by renaming them in the hosts files in /etc folder.
It should look like the following:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1   localhost.localdomain   localhost   node01
192.168.0.1 node01
192.168.0.2 node02
192.168.0.3 node03

Hope this helps
Mithila

On Wed, Apr 15, 2009 at 9:40 PM, jpe30  wrote:

>
> I'm setting up a Hadoop cluster and I have the name node and job tracker up
> and running.  However, I cannot get any of my datanodes or tasktrackers to
> start.  Here is my hadoop-site.xml file...
>
>
>
> 
> 
>
> 
>
> 
>
> 
>  hadoop.tmp.dir
>  /home/hadoop/h_temp
>  A base for other temporary directories.
> 
>
> 
>  dfs.data.dir
>  /home/hadoop/data
> 
>
> 
>  fs.default.name
>   192.168.1.10:54310
>  The name of the default file system.  A URI whose
>   scheme and authority determine the FileSystem implementation.  The
>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>  the FileSystem implementation class.  The uri's authority is used to
>   determine the host, port, etc. for a filesystem.
>  true
> 
>
> 
>  mapred.job.tracker
>   192.168.1.10:54311
>  The host and port that the MapReduce job tracker runs
>   at.  If "local", then jobs are run in-process as a single map
>  and reduce task.
>   
> 
>
> 
>  dfs.replication
>  0
>   Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>  The default is used if replication is not specified in create time.
>   
> 
>
> 
>
>
> and here is the error I'm getting...
>
>
>
>
> 2009-04-15 14:00:48,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250;
> compiled by 'ndaley' on Thu Jan 22 23:12:0$
> /
> 2009-04-15 14:00:48,355 ERROR org.apache.hadoop.dfs.DataNode:
> java.net.UnknownHostException: myhost: myhost
>at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
>at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185)
>at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:249)
> at org.apache.hadoop.dfs.DataNode.(DataNode.java:223)
> at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071)
>at
> org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026)
>at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034)
>at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156)
>
> 2009-04-15 14:00:48,356 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException:
> myhost: myhost
> /
>
> --
> View this message in context:
> http://www.nabble.com/Datanode-Setup-tp23064660p23064660.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Datanode Setup

2009-04-15 Thread jpe30

That helps a lot actually.  I will try setting up my hosts file tomorrow and
make the other changes you suggested.

Thanks!



Mithila Nagendra wrote:
> 
> Hi,
> The replication factor has to be set to 1. Also for you dfs and job
> tracker
> configuration you should insert the name of the node rather than the i.p
> address.
> 
> For instance:
>  192.168.1.10:54310
> 
> can be:
> 
>  master:54310
> 
> The nodes can be renamed by renaming them in the hosts files in /etc
> folder.
> It should look like the following:
> 
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 127.0.0.1   localhost.localdomain   localhost   node01
> 192.168.0.1 node01
> 192.168.0.2 node02
> 192.168.0.3 node03
> 
> Hope this helps
> Mithila
> 
> On Wed, Apr 15, 2009 at 9:40 PM, jpe30  wrote:
> 
>>
>> I'm setting up a Hadoop cluster and I have the name node and job tracker
>> up
>> and running.  However, I cannot get any of my datanodes or tasktrackers
>> to
>> start.  Here is my hadoop-site.xml file...
>>
>>
>>
>> 
>> 
>>
>> 
>>
>> 
>>
>> 
>>  hadoop.tmp.dir
>>  /home/hadoop/h_temp
>>  A base for other temporary directories.
>> 
>>
>> 
>>  dfs.data.dir
>>  /home/hadoop/data
>> 
>>
>> 
>>  fs.default.name
>>   192.168.1.10:54310
>>  The name of the default file system.  A URI whose
>>   scheme and authority determine the FileSystem implementation.  The
>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>  the FileSystem implementation class.  The uri's authority is used to
>>   determine the host, port, etc. for a filesystem.
>>  true
>> 
>>
>> 
>>  mapred.job.tracker
>>   192.168.1.10:54311
>>  The host and port that the MapReduce job tracker runs
>>   at.  If "local", then jobs are run in-process as a single map
>>  and reduce task.
>>   
>> 
>>
>> 
>>  dfs.replication
>>  0
>>   Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>   
>> 
>>
>> 
>>
>>
>> and here is the error I'm getting...
>>
>>
>>
>>
>> 2009-04-15 14:00:48,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
>> /
>> STARTUP_MSG: Starting DataNode
>> STARTUP_MSG:   host = java.net.UnknownHostException: myhost: myhost
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 0.18.3
>> STARTUP_MSG:   build =
>> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
>> 736250;
>> compiled by 'ndaley' on Thu Jan 22 23:12:0$
>> /
>> 2009-04-15 14:00:48,355 ERROR org.apache.hadoop.dfs.DataNode:
>> java.net.UnknownHostException: myhost: myhost
>>at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
>>at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185)
>>at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:249)
>> at org.apache.hadoop.dfs.DataNode.(DataNode.java:223)
>> at
>> org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3071)
>>at
>> org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:3026)
>>at
>> org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:3034)
>>at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3156)
>>
>> 2009-04-15 14:00:48,356 INFO org.apache.hadoop.dfs.DataNode:
>> SHUTDOWN_MSG:
>> /
>> SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException:
>> myhost: myhost
>> /
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Datanode-Setup-tp23064660p23064660.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Datanode-Setup-tp23064660p23065220.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Extending ClusterMapReduceTestCase

2009-04-15 Thread czero

I got it all up and working, thanks for your help - it was an issue with me
not actually setting the log.dir system property before the cluster startup. 
Can't believe I missed that one :)

As a side note (which you might already be aware of), the example class
you're using in Chapter 7 (PiEstimator) has changed in Hadoop 0.19.1 such
that the example code no longer works.  The new one is a little trickier to
test.

I'm looking forward to seeing the rest of the book.  And that delegate test
harness when it's available :)


jason hadoop wrote:
> 
> btw that stack trace looks like the hadoop.log.dir issue
> This is the code out of the init method, in JobHistory
> 
> LOG_DIR = conf.get("hadoop.job.history.location" ,
> "file:///" + new File(
> System.getProperty("hadoop.log.dir")).getAbsolutePath()
> + File.separator + "history");
> 
> looks like the hadoop.log.dir system property is not set, note: not
> environment variable, not configuration parameter, but system property.
> 
> Try a *System.setProperty("hadoop.log.dir","/tmp");* in your code before
> you
> initialize the virtual cluster.
> 
> 
> 
> On Tue, Apr 14, 2009 at 5:56 PM, jason hadoop
> wrote:
> 
>>
>> I have actually built an add on class on top of ClusterMapReduceDelegate
>> that just runs a virtual cluster that persists for running tests on, it
>> is
>> very nice, as you can interact via the web ui.
>> Especially since the virtual cluster stuff is somewhat flaky under
>> windows.
>>
>> I have a question in to the editor about the sample code.
>>
>>
>>
>> On Tue, Apr 14, 2009 at 8:16 AM, czero  wrote:
>>
>>>
>>> I actually picked up the alpha .PDF's of your book, great job.
>>>
>>> I'm following the example in chapter 7 to the letter now and am still
>>> getting the same problem.  2 quick questions (and thanks for your time
>>> in
>>> advance)...
>>>
>>> Is the ClusterMapReduceDelegate class available anywhere yet?
>>>
>>> Adding ~/hadoop/libs/*.jar in it's entirety to my pom.xml is a lot of
>>> bulk,
>>> so I've avoided it until now.  Are there any lib's in there that are
>>> absolutely necessary for this test to work?
>>>
>>> Thanks again,
>>> bc
>>>
>>>
>>>
>>> jason hadoop wrote:
>>> >
>>> > I have a nice variant of this in the ch7 examples section of my book,
>>> > including a standalone wrapper around the virtual cluster for allowing
>>> > multiple test instances to share the virtual cluster - and allow an
>>> easier
>>> > time to poke around with the input and output datasets.
>>> >
>>> > It even works decently under windows - my editor insisting on word to
>>> > recent
>>> > for crossover.
>>> >
>>> > On Mon, Apr 13, 2009 at 9:16 AM, czero  wrote:
>>> >
>>> >>
>>> >> Sry, I forgot to include the not-IntelliJ-console output :)
>>> >>
>>> >> 09/04/13 12:07:14 ERROR mapred.MiniMRCluster: Job tracker crashed
>>> >> java.lang.NullPointerException
>>> >>at java.io.File.(File.java:222)
>>> >>at
>>> org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:143)
>>> >>at
>>> >> org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1110)
>>> >>at
>>> >> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:143)
>>> >>at
>>> >>
>>> >>
>>> org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:96)
>>> >>at java.lang.Thread.run(Thread.java:637)
>>> >>
>>> >> I managed to pick up the chapter in the Hadoop Book that Jason
>>> mentions
>>> >> that
>>> >> deals with Unit testing (great chapter btw) and it looks like
>>> everything
>>> >> is
>>> >> in order.  He points out that this error is typically caused by a bad
>>> >> hadoop.log.dir or missing log4j.properties, but I verified that my
>>> dir
>>> is
>>> >> ok
>>> >> and my hadoop-0.19.1-core.jar has the log4j.properties in it.
>>> >>
>>> >> I also tried running the same test with hadoop-core/test 0.19.0 -
>>> same
>>> >> thing.
>>> >>
>>> >> Thanks again,
>>> >>
>>> >> bc
>>> >>
>>> >>
>>> >> czero wrote:
>>> >> >
>>> >> > Hey all,
>>> >> >
>>> >> > I'm also extending the ClusterMapReduceTestCase and having a bit of
>>> >> > trouble as well.
>>> >> >
>>> >> > Currently I'm getting :
>>> >> >
>>> >> > Starting DataNode 0 with dfs.data.dir:
>>> >> > build/test/data/dfs/data/data1,build/test/data/dfs/data/data2
>>> >> > Starting DataNode 1 with dfs.data.dir:
>>> >> > build/test/data/dfs/data/data3,build/test/data/dfs/data/data4
>>> >> > Generating rack names for tasktrackers
>>> >> > Generating host names for tasktrackers
>>> >> >
>>> >> > And then nothing... just spins on that forever.  Any ideas?
>>> >> >
>>> >> > I have all the jetty and jetty-ext libs in the classpath and I set
>>> the
>>> >> > hadoop.log.dir and the SAX parser correctly.
>>> >> >
>>> >> > This is all I have for my test class so far, I'm not even doing
>>> >> anything
>>> >> > yet:
>>> >> >
>>> >> > public class TestDoop extends ClusterMapReduceTestCase {
>>> >> >
>>> >> > @Test
>>> >> > public void testDoop() throws Exce

RE: reduce task specific jvm arg

2009-04-15 Thread Koji Noguchi
This sounds like a reasonable request.

Created 
https://issues.apache.org/jira/browse/HADOOP-5684

On our clusters, sometimes users want thin mappers and large reducers.

Koji

-Original Message-
From: Jun Rao [mailto:jun...@almaden.ibm.com] 
Sent: Thursday, April 09, 2009 10:30 AM
To: core-user@hadoop.apache.org
Subject: reduce task specific jvm arg

Hi,

Is there a way to set jvm parameters only for reduce tasks in Hadoop?
Thanks,

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

jun...@almaden.ibm.com


Re: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist

2009-04-15 Thread Alex Loddengaard
Data stored to /tmp has no consistency / reliability guarantees.  Your OS
can delete that data at any time.

Configure hadoop-site.xml to store data elsewhere.  Grep for "/tmp" in
hadoop-default.xml to see all the configuration options you'll have to
change.  Here's the list I came up with:

hadoop.tmp.dir
fs.checkpoint.dir
dfs.name.dir
dfs.data.dir
mapred.local.dir
mapred.system.dir
mapred.temp.dir

Again, you need to be storing your data somewhere other than /tmp.

Alex

On Tue, Apr 14, 2009 at 6:06 PM, Pankil Doshi  wrote:

> Hello Everyone,
>
> At time I get following error,when i restart my cluster desktops.(Before
> that I shutdown mapred and dfs properly though).
> Temp folder contains of the directory its looking for.Still I get this
> error.
> Only solution I found to get rid with this error is I have to format my dfs
> entirely and then load the data again. and start whole process.
>
> But in that I loose my data on HDFS and I have to reload it.
>
> Does anyone has any clue abt it?
>
> Error from log fil e:-
>
> 2009-04-14 19:40:29,963 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = Semantic002/192.168.1.133
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250;
> compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> /
> 2009-04-14 19:40:30,958 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2009-04-14 19:40:30,996 INFO org.apache.hadoop.dfs.NameNode: Namenode up
> at:
> Semantic002/192.168.1.133:9000
> 2009-04-14 19:40:31,007 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2009-04-14 19:40:31,014 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> Initializing NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NullCont
> ext
> 2009-04-14 19:40:31,160 INFO org.apache.hadoop.fs.FSNamesystem:
>
> fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,plugdev,scanner,fuse,admin
> 2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
> supergroup=supergroup
> 2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
> isPermissionEnabled=true
> 2009-04-14 19:40:31,183 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> Initializing FSNamesystemMeterics using context
> object:org.apache.hadoop.metrics.spi.
> NullContext
> 2009-04-14 19:40:31,184 INFO org.apache.hadoop.fs.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2009-04-14 19:40:31,248 INFO org.apache.hadoop.dfs.Storage: Storage
> directory /tmp/hadoop-hadoop/dfs/name does not exist.
> 2009-04-14 19:40:31,251 ERROR org.apache.hadoop.fs.FSNamesystem:
> FSNamesystem initialization failed.
> org.apache.hadoop.dfs.InconsistentFSStateException: Directory
> /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory
> does not exist or is
>  not accessible.
>at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
>at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
>at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
> 2009-04-14 19:40:31,261 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 9000
> 2009-04-14 19:40:31,262 ERROR org.apache.hadoop.dfs.NameNode:
> org.apache.hadoop.dfs.InconsistentFSStateException: Directory
> /tmp/hadoop-hadoop/dfs/name is in
>  an inconsistent state: storage directory does not exist or is not
> accessible.
>at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
>at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
>at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
>
> 2009-04-14 19:40:31,267 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
> /
> :
>
> Thanks
>
> Pankil
>


Re: getting DiskErrorException during map

2009-04-15 Thread Jim Twensky
Alex,

Yes, I bounced the Hadoop daemons after I changed the configuration files.

I also tried setting  $HADOOP_CONF_DIR to the directory where my
hadop-site.xml file resides but it didn't work.
However, I'm sure that HADOOP_CONF_DIR is not the issue because other
properties that I changed in hadoop-site.xml
seem to be properly set. Also, here is a section from my hadoop-site.xml
file:


hadoop.tmp.dir
/scratch/local/jim/hadoop-${user.name}


mapred.local.dir
/scratch/local/jim/hadoop-${user.name}/mapred/local


I also created /scratch/local/jim/hadoop-jim/mapred/local on each task
tracker since I know
directories that do not exist are ignored.

When I manually ssh to the task trackers, I can see the directory
/scratch/local/jim/hadoop-jim/dfs
is automatically created so is it seems like  hadoop.tmp.dir is set
properly. However, hadoop still creates
/tmp/hadoop-jim/mapred/local and uses that directory for the local storage.

I'm starting to suspect that mapred.local.dir is overwritten to a default
value of /tmp/hadoop-${user.name}
somewhere inside the binaries.

-jim

On Tue, Apr 14, 2009 at 4:07 PM, Alex Loddengaard  wrote:

> First, did you bounce the Hadoop daemons after you changed the
> configuration
> files?  I think you'll have to do this.
>
> Second, I believe 0.19.1 has hadoop-default.xml baked into the jar.  Try
> setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives.  For
> whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried
> to change) are probably not being loaded.  $HADOOP_CONF_DIR should fix
> this.
>
> Good luck!
>
> Alex
>
> On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky 
> wrote:
>
> > Thank you Alex, you are right. There are quotas on the systems that I'm
> > working. However, I tried to change mapred.local.dir as follows:
> >
> > --inside hadoop-site.xml:
> >
> >
> >mapred.child.tmp
> >/scratch/local/jim
> >
> >
> >hadoop.tmp.dir
> >/scratch/local/jim
> >
> >
> >mapred.local.dir
> >/scratch/local/jim
> >
> >
> >  and observed that the intermediate map outputs are still being written
> > under /tmp/hadoop-jim/mapred/local
> >
> > I'm confused at this point since I also tried setting these values
> directly
> > inside the hadoop-default.xml and that didn't work either. Is there any
> > other property that I'm supposed to change? I tried searching for "/tmp"
> in
> > the hadoop-default.xml file but couldn't find anything else.
> >
> > Thanks,
> > Jim
> >
> >
> > On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard 
> > wrote:
> >
> > > The getLocalPathForWrite function that throws this Exception assumes
> that
> > > you have space on the disks that mapred.local.dir is configured on.
>  Can
> > > you
> > > verify with `df` that those disks have space available?  You might also
> > try
> > > moving mapred.local.dir off of /tmp if it's configured to use /tmp
> right
> > > now; I believe some systems have quotas on /tmp.
> > >
> > > Hope this helps.
> > >
> > > Alex
> > >
> > > On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm using Hadoop 0.19.1 and I have a very small test cluster with 9
> > > nodes,
> > > > 8
> > > > of them being task trackers. I'm getting the following error and my
> > jobs
> > > > keep failing when map processes start hitting 30%:
> > > >
> > > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> > any
> > > > valid local directory for
> > > >
> > > >
> > >
> >
> taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out
> > > >at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
> > > >at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> > > >at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> > > >at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
> > > >at
> > > >
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
> > > >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> > > >at org.apache.hadoop.mapred.Child.main(Child.java:158)
> > > >
> > > >
> > > > I googled many blogs and web pages but I could neither understand why
> > > this
> > > > happens nor found a solution to this. What does that error message
> mean
> > > and
> > > > how can avoid it, any suggestions?
> > > >
> > > > Thanks in advance,
> > > > -jim
> > > >
> > >
> >
>


Re: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist

2009-04-15 Thread Pankil Doshi
Thanks

Pankil

On Wed, Apr 15, 2009 at 5:09 PM, Alex Loddengaard  wrote:

> Data stored to /tmp has no consistency / reliability guarantees.  Your OS
> can delete that data at any time.
>
> Configure hadoop-site.xml to store data elsewhere.  Grep for "/tmp" in
> hadoop-default.xml to see all the configuration options you'll have to
> change.  Here's the list I came up with:
>
> hadoop.tmp.dir
> fs.checkpoint.dir
> dfs.name.dir
> dfs.data.dir
> mapred.local.dir
> mapred.system.dir
> mapred.temp.dir
>
> Again, you need to be storing your data somewhere other than /tmp.
>
> Alex
>
> On Tue, Apr 14, 2009 at 6:06 PM, Pankil Doshi  wrote:
>
> > Hello Everyone,
> >
> > At time I get following error,when i restart my cluster desktops.(Before
> > that I shutdown mapred and dfs properly though).
> > Temp folder contains of the directory its looking for.Still I get this
> > error.
> > Only solution I found to get rid with this error is I have to format my
> dfs
> > entirely and then load the data again. and start whole process.
> >
> > But in that I loose my data on HDFS and I have to reload it.
> >
> > Does anyone has any clue abt it?
> >
> > Error from log fil e:-
> >
> > 2009-04-14 19:40:29,963 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
> > /
> > STARTUP_MSG: Starting NameNode
> > STARTUP_MSG:   host = Semantic002/192.168.1.133
> > STARTUP_MSG:   args = []
> > STARTUP_MSG:   version = 0.18.3
> > STARTUP_MSG:   build =
> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> > 736250;
> > compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> > /
> > 2009-04-14 19:40:30,958 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > Initializing RPC Metrics with hostName=NameNode, port=9000
> > 2009-04-14 19:40:30,996 INFO org.apache.hadoop.dfs.NameNode: Namenode up
> > at:
> > Semantic002/192.168.1.133:9000
> > 2009-04-14 19:40:31,007 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > 2009-04-14 19:40:31,014 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> > Initializing NameNodeMeterics using context
> > object:org.apache.hadoop.metrics.spi.NullCont
> > ext
> > 2009-04-14 19:40:31,160 INFO org.apache.hadoop.fs.FSNamesystem:
> >
> >
> fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,plugdev,scanner,fuse,admin
> > 2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
> > supergroup=supergroup
> > 2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
> > isPermissionEnabled=true
> > 2009-04-14 19:40:31,183 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> > Initializing FSNamesystemMeterics using context
> > object:org.apache.hadoop.metrics.spi.
> > NullContext
> > 2009-04-14 19:40:31,184 INFO org.apache.hadoop.fs.FSNamesystem:
> Registered
> > FSNamesystemStatusMBean
> > 2009-04-14 19:40:31,248 INFO org.apache.hadoop.dfs.Storage: Storage
> > directory /tmp/hadoop-hadoop/dfs/name does not exist.
> > 2009-04-14 19:40:31,251 ERROR org.apache.hadoop.fs.FSNamesystem:
> > FSNamesystem initialization failed.
> > org.apache.hadoop.dfs.InconsistentFSStateException: Directory
> > /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage
> directory
> > does not exist or is
> >  not accessible.
> >at
> > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
> >at
> > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
> >at
> > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
> >at
> org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
> >at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
> >at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
> >at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
> >at
> org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
> >at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
> > 2009-04-14 19:40:31,261 INFO org.apache.hadoop.ipc.Server: Stopping
> server
> > on 9000
> > 2009-04-14 19:40:31,262 ERROR org.apache.hadoop.dfs.NameNode:
> > org.apache.hadoop.dfs.InconsistentFSStateException: Directory
> > /tmp/hadoop-hadoop/dfs/name is in
> >  an inconsistent state: storage directory does not exist or is not
> > accessible.
> >at
> > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
> >at
> > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
> >at
> > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
> >at
> org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
> >at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
> >at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
> >at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
> >at
> org.apache.hadoop.dfs.

Error reading task output

2009-04-15 Thread Cam Macdonell


Hi,

I'm getting the following warning when running the simple wordcount and 
grep examples.


09/04/15 16:54:16 INFO mapred.JobClient: Task Id : 
attempt_200904151649_0001_m_19_0, Status : FAILED

Too many fetch-failures
09/04/15 16:54:16 WARN mapred.JobClient: Error reading task 
outputhttp://localhost.localdomain:50060/tasklog?plaintext=true&taskid=attempt_200904151649_0001_m_19_0&filter=stdout
09/04/15 16:54:16 WARN mapred.JobClient: Error reading task 
outputhttp://localhost.localdomain:50060/tasklog?plaintext=true&taskid=attempt_200904151649_0001_m_19_0&filter=stderr


The only advice I could find from other posts with similar errors is to 
setup /etc/hosts with all slaves and the host IPs.  I did this, but I 
still get the warning above.  The output seems to come out alright 
however (I guess that's why it is a warning).


I tried running a wget on the http:// address in the warning message and 
I get the following back


2009-04-15 16:53:46 ERROR 400: Argument taskid is required.

So perhaps the wrong task ID is being passed to the http request.  Any 
ideas on what can get rid of these warnings?


Thanks,
Cam


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-15 Thread Kevin Peterson
On Tue, Apr 14, 2009 at 2:35 AM, tim robertson wrote:

>
> I am considering (for better throughput as maps generate huge request
> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
> cloudfront.  There will be billions of PNGs produced each at 1-3KB
> each.
>

Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
there is no need to generate them and then push them at once, if you are
storing them each in their own S3 object (which they must be, if you intend
to fetch them using cloudfront). Each S3 object is unique, and can be
written fully in parallel. If you are writing to the same S3 object twice,
... well, you're doing it wrong.

However, do the math on the costs for S3. We were doing something similar,
and found that we were spending a fortune on our put requests at $0.01 per
1000, and next to nothing on storage. I've since moved to a more complicated
model where I pack many small items in each object and store an index in
simpledb. You'll need to partition your SimpleDBs if you do this.


Re: Map-Reduce Slow Down

2009-04-15 Thread jason hadoop
Double check that there is no firewall in place.
At one point a bunch of new machines were kickstarted and placed in a
cluster and they all failed with something similar.
It turned out the kickstart script turned enabled the firewall with a rule
that blocked ports in the 50k range.
It took us a while to even think to check that was not a part of our normal
machine configuration

On Wed, Apr 15, 2009 at 11:04 AM, Mithila Nagendra  wrote:

> Hi Aaron
> I will look into that thanks!
>
> I spoke to the admin who overlooks the cluster. He said that the gateway
> comes in to the picture only when one of the nodes communicates with a node
> outside of the cluster. But in my case the communication is carried out
> between the nodes which all belong to the same cluster.
>
> Mithila
>
> On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball  wrote:
>
> > Hi,
> >
> > I wrote a blog post a while back about connecting nodes via a gateway.
> See
> >
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> >
> > This assumes that the client is outside the gateway and all
> > datanodes/namenode are inside, but the same principles apply. You'll just
> > need to set up ssh tunnels from every datanode to the namenode.
> >
> > - Aaron
> >
> >
> > On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari  >wrote:
> >
> >> Looks like your NameNode is down .
> >> Verify if hadoop process are running (   jps should show you all java
> >> running process).
> >> If your hadoop process are running try restarting your hadoop process .
> >> I guess this problem is due to your fsimage not being correct .
> >> You might have to format your namenode.
> >> Hope this helps.
> >>
> >> Thanks,
> >> --
> >> Ravi
> >>
> >>
> >> On 4/15/09 10:15 AM, "Mithila Nagendra"  wrote:
> >>
> >> The log file runs into thousands of line with the same message being
> >> displayed every time.
> >>
> >> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra 
> >> wrote:
> >>
> >> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> >> > following in it:
> >> >
> >> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode:
> >> STARTUP_MSG:
> >> > /
> >> > STARTUP_MSG: Starting DataNode
> >> > STARTUP_MSG:   host = node19/127.0.0.1
> >> > STARTUP_MSG:   args = []
> >> > STARTUP_MSG:   version = 0.18.3
> >> > STARTUP_MSG:   build =
> >> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> >> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> >> > /
> >> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> >> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> >> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> >> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> >> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> >> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> >> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> >> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> >> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> >> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at
> >> node18/
> >> > 192.168.0.18:54310 not available yet, Z...
> >> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> >> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> >> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> >> > 2009-04-14 10:0

RE: More Replication on dfs

2009-04-15 Thread Puri, Aseem
Hi
My problem is not that my data is under replicated. I have 3
data nodes. In my hadoop-site.xml I also set the configuration as:

  
  dfs.replication
  2
  

But after this also data is replicated on 3 nodes instead of two nodes.

Now, please tell what can be the problem?

Thanks & Regards
Aseem Puri



-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Wednesday, April 15, 2009 2:58 AM
To: core-user@hadoop.apache.org
Subject: Re: More Replication on dfs

Aseem,

Regd over-replication, it is mostly app related issue as Alex mentioned.

But if you are concerned about under-replicated blocks in fsck output :

These blocks should not stay under-replicated if you have enough nodes 
and enough space on them (check NameNode webui).

Try grep-ing for one of the blocks in NameNode log (and datnode logs as 
well, since you have just 3 nodes).

Raghu.

Puri, Aseem wrote:
> Alex,
> 
> Ouput of $ bin/hadoop fsck / command after running HBase data insert
> command in a table is:
> 
> .
> .
> .
> .
> .
> /hbase/test/903188508/tags/info/4897652949308499876:  Under replicated
> blk_-5193
> 695109439554521_3133. Target Replicas is 3 but found 1 replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/data:  Under
> replicated
> blk_-1213602857020415242_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/index:  Under
> replicated
>  blk_3934493034551838567_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /user/HadoopAdmin/hbase table.doc:  Under replicated
> blk_4339521803948458144_103
> 1. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/bin.doc:  Under replicated
> blk_-3661765932004150973_1030
> . Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file01.txt:  Under replicated
> blk_2744169131466786624_10
> 01. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file02.txt:  Under replicated
> blk_2021956984317789924_10
> 02. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/test.txt:  Under replicated
> blk_-3062256167060082648_100
> 4. Target Replicas is 3 but found 2 replica(s).
> ...
> /user/HadoopAdmin/output/part-0:  Under replicated
> blk_8908973033976428484_1
> 010. Target Replicas is 3 but found 2 replica(s).
> Status: HEALTHY
>  Total size:48510226 B
>  Total dirs:492
>  Total files:   439 (Files currently being written: 2)
>  Total blocks (validated):  401 (avg. block size 120973 B) (Total
> open file
> blocks (not validated): 2)
>  Minimally replicated blocks:   401 (100.0 %)
>  Over-replicated blocks:0 (0.0 %)
>  Under-replicated blocks:   399 (99.50124 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor:2
>  Average block replication: 1.3117207
>  Corrupt blocks:0
>  Missing replicas:  675 (128.327 %)
>  Number of data-nodes:  2
>  Number of racks:   1
> 
> 
> The filesystem under path '/' is HEALTHY
> Please tell what is wrong.
> 
> Aseem
> 
> -Original Message-
> From: Alex Loddengaard [mailto:a...@cloudera.com] 
> Sent: Friday, April 10, 2009 11:04 PM
> To: core-user@hadoop.apache.org
> Subject: Re: More Replication on dfs
> 
> Aseem,
> 
> How are you verifying that blocks are not being replicated?  Have you
> ran
> fsck?  *bin/hadoop fsck /*
> 
> I'd be surprised if replication really wasn't happening.  Can you run
> fsck
> and pay attention to "Under-replicated blocks" and "Mis-replicated
> blocks?"
> In fact, can you just copy-paste the output of fsck?
> 
> Alex
> 
> On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem
> wrote:
> 
>> Hi
>>I also tried the command $ bin/hadoop balancer. But still the
>> same problem.
>>
>> Aseem
>>
>> -Original Message-
>> From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
>> Sent: Friday, April 10, 2009 11:18 AM
>> To: core-user@hadoop.apache.org
>> Subject: RE: More Replication on dfs
>>
>> Hi Alex,
>>
>>Thanks for sharing your knowledge. Till now I have three
>> machines and I have to check the behavior of Hadoop so I want
>> replication factor should be 2. I started my Hadoop server with
>> replication factor 3. After that I upload 3 files to implement word
>> count program. But as my all files are stored on one machine and
>> replicated to other datanodes also, so my map reduce program takes
> input
>> from one Datanode only. I want my files to be on different data node
> so
>> to check functionality of map reduce properly.
>>
>>Also before starting my Hadoop server again with replication
>> factor 2 I formatted all Datanodes and deleted all old data manually.
>>
>> Please suggest what I should do now.
>>
>> Regards,
>> Aseem Puri
>>
>>
>> -Original Message-
>> From: Mithila Nagendra [mailto:mnage...@asu.edu]
>> Sent: Friday, April 10, 2009 10:56 AM
>> To: core-user@ha

RE: More Replication on dfs

2009-04-15 Thread Puri, Aseem
Hi
My problem is not that my data is under replicated. I have 3
data nodes. In my hadoop-site.xml I also set the configuration as:

  
  dfs.replication
  2
  

But after this also data is replicated on 3 nodes instead of two nodes.

Now, please tell what can be the problem?

Thanks & Regards
Aseem Puri

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Wednesday, April 15, 2009 2:58 AM
To: core-user@hadoop.apache.org
Subject: Re: More Replication on dfs

Aseem,

Regd over-replication, it is mostly app related issue as Alex mentioned.

But if you are concerned about under-replicated blocks in fsck output :

These blocks should not stay under-replicated if you have enough nodes 
and enough space on them (check NameNode webui).

Try grep-ing for one of the blocks in NameNode log (and datnode logs as 
well, since you have just 3 nodes).

Raghu.

Puri, Aseem wrote:
> Alex,
> 
> Ouput of $ bin/hadoop fsck / command after running HBase data insert
> command in a table is:
> 
> .
> .
> .
> .
> .
> /hbase/test/903188508/tags/info/4897652949308499876:  Under replicated
> blk_-5193
> 695109439554521_3133. Target Replicas is 3 but found 1 replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/data:  Under
> replicated
> blk_-1213602857020415242_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/index:  Under
> replicated
>  blk_3934493034551838567_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /user/HadoopAdmin/hbase table.doc:  Under replicated
> blk_4339521803948458144_103
> 1. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/bin.doc:  Under replicated
> blk_-3661765932004150973_1030
> . Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file01.txt:  Under replicated
> blk_2744169131466786624_10
> 01. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file02.txt:  Under replicated
> blk_2021956984317789924_10
> 02. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/test.txt:  Under replicated
> blk_-3062256167060082648_100
> 4. Target Replicas is 3 but found 2 replica(s).
> ...
> /user/HadoopAdmin/output/part-0:  Under replicated
> blk_8908973033976428484_1
> 010. Target Replicas is 3 but found 2 replica(s).
> Status: HEALTHY
>  Total size:48510226 B
>  Total dirs:492
>  Total files:   439 (Files currently being written: 2)
>  Total blocks (validated):  401 (avg. block size 120973 B) (Total
> open file
> blocks (not validated): 2)
>  Minimally replicated blocks:   401 (100.0 %)
>  Over-replicated blocks:0 (0.0 %)
>  Under-replicated blocks:   399 (99.50124 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor:2
>  Average block replication: 1.3117207
>  Corrupt blocks:0
>  Missing replicas:  675 (128.327 %)
>  Number of data-nodes:  2
>  Number of racks:   1
> 
> 
> The filesystem under path '/' is HEALTHY
> Please tell what is wrong.
> 
> Aseem
> 
> -Original Message-
> From: Alex Loddengaard [mailto:a...@cloudera.com] 
> Sent: Friday, April 10, 2009 11:04 PM
> To: core-user@hadoop.apache.org
> Subject: Re: More Replication on dfs
> 
> Aseem,
> 
> How are you verifying that blocks are not being replicated?  Have you
> ran
> fsck?  *bin/hadoop fsck /*
> 
> I'd be surprised if replication really wasn't happening.  Can you run
> fsck
> and pay attention to "Under-replicated blocks" and "Mis-replicated
> blocks?"
> In fact, can you just copy-paste the output of fsck?
> 
> Alex
> 
> On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem
> wrote:
> 
>> Hi
>>I also tried the command $ bin/hadoop balancer. But still the
>> same problem.
>>
>> Aseem
>>
>> -Original Message-
>> From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
>> Sent: Friday, April 10, 2009 11:18 AM
>> To: core-user@hadoop.apache.org
>> Subject: RE: More Replication on dfs
>>
>> Hi Alex,
>>
>>Thanks for sharing your knowledge. Till now I have three
>> machines and I have to check the behavior of Hadoop so I want
>> replication factor should be 2. I started my Hadoop server with
>> replication factor 3. After that I upload 3 files to implement word
>> count program. But as my all files are stored on one machine and
>> replicated to other datanodes also, so my map reduce program takes
> input
>> from one Datanode only. I want my files to be on different data node
> so
>> to check functionality of map reduce properly.
>>
>>Also before starting my Hadoop server again with replication
>> factor 2 I formatted all Datanodes and deleted all old data manually.
>>
>> Please suggest what I should do now.
>>
>> Regards,
>> Aseem Puri
>>
>>
>> -Original Message-
>> From: Mithila Nagendra [mailto:mnage...@asu.edu]
>> Sent: Friday, April 10, 2009 10:56 AM
>> To: core-user@hado