Re: Hadoop & EC2

2008-09-01 Thread Andrew Hitchcock
Hi Ryan,

Just a heads up, if you require more than the 20 node limit, Amazon
provides a form to request a higher limit:

http://www.amazon.com/gp/html-forms-controller/ec2-request

Andrew

On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I'm curious to see how many people are using EC2 to execute their
> Hadoop cluster and map/reduce programs, and how many are using
> home-grown datacenters. It seems like the 20 node limit with EC2 is a
> bit crippling when one wants to process many gigabytes of data. Has
> anyone found this to be the case? How much data are people processing
> with their 20 node limit on EC2? Curious what the thoughts are...
>
> Thanks,
> Ryan
>


Hadoop & EC2

2008-09-01 Thread Ryan LeCompte
Hello all,

I'm curious to see how many people are using EC2 to execute their
Hadoop cluster and map/reduce programs, and how many are using
home-grown datacenters. It seems like the 20 node limit with EC2 is a
bit crippling when one wants to process many gigabytes of data. Has
anyone found this to be the case? How much data are people processing
with their 20 node limit on EC2? Curious what the thoughts are...

Thanks,
Ryan


Re: Hadoop on Ubuntu Setup Guide

2008-09-01 Thread Alex Loddengaard
Thanks, Camilo.  I've added your guide to the Ubuntu wiki page:

<
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
>

Feel free to update that wiki page as you see fit :).

Alex

On Tue, Sep 2, 2008 at 12:21 PM, Camilo Gonzalez <[EMAIL PROTECTED]> wrote:

> Hi!
>
> I don't know if it maybe useful, but I've just published a guide on setting
> up Hadoop on Ubuntu 8.04. Please check it out. This is the first time I
> publish any guides/tutorials or others on the Web so all feedback is
> appreciated.
>
> As a last step of the guide, there is an small Eclipse Java Project that
> uses the Hadoop API to query some folder's files.
>
> http://b.camilogt.com/2008/09/02/hadoop-on-ubuntu-805.aspx
>
> Have a nice day,
>
> Camilo Gonzalez
>


Hadoop on Ubuntu Setup Guide

2008-09-01 Thread Camilo Gonzalez
Hi!

I don't know if it maybe useful, but I've just published a guide on setting
up Hadoop on Ubuntu 8.04. Please check it out. This is the first time I
publish any guides/tutorials or others on the Web so all feedback is
appreciated.

As a last step of the guide, there is an small Eclipse Java Project that
uses the Hadoop API to query some folder's files.

http://b.camilogt.com/2008/09/02/hadoop-on-ubuntu-805.aspx

Have a nice day,

Camilo Gonzalez


Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

2008-09-01 Thread Hemanth Yamijala

Allen Wittenauer wrote:


On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote:
  

Well but I haven't understand how I should configurate HOD to work in this
manner.

For HDFS I folllow this sequence of steps
- conf/master contain only master node of my cluster
- conf/slaves contain all nodes
- I start HDFS using bin/start-dfs.sh



Right, fine...

  

Potentially I would allow to use all nodes for MapReduce.
For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
change only the gridservice-hdfs section?



I was hoping the HOD folks would answer this question for you, but they
are apparently sleeping. :)

  

Woops ! Sorry, I missed this.

Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
use that as the -default- HDFS. That doesn't prevent a user from using HOD
to create a custom HDFS as part of their job submission.

  
Allen's answer is perfect. Please refer to 
http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
for more information about how to set up the gridservice-hdfs section to 
use a static or

external HDFS.




Re: basic questions about Hadoop!

2008-09-01 Thread Mafish Liu
On Sat, Aug 30, 2008 at 10:12 AM, Gerardo Velez <[EMAIL PROTECTED]>wrote:

> Hi Victor!
>
> I got problem with remote writing as well, so I tried to go further on this
> and I would like to share what I did, maybe you have more luck than me
>
> 1) as I'm working with user gvelez in remote host I had to give write
> access
> to all, like this:
>
>bin/hadoop dfs -chmod -R a+w input
>
> 2) After that, there is no more connection refused error, but instead I got
> following exception
>
>
>
> $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
> cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
> 08/08/29 19:06:51 INFO dfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> jav
> a.io.IOException: File /user/hadoop/input/README.txt could only be
> replicated to
>  0 nodes, instead of 1
>at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
> va:1145)
>at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
>at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:585)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
> How many datanode do you have ? Only one, I guess.
Modify your $HADOOP_HOME/conf/hadoop-site.xml and lookup


dfs.replication
1


set value to 0.


>
> On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov <
> [EMAIL PROTECTED]
> > wrote:
>
> > Jeff,
> >
> > Thanks for detailed instructions, but on machine that is not hadoop
> server
> > I
> > got error:
> > ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
> > 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in
> createBlockOutputStream
> > java.net.ConnectException: Connection refused
> > 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
> > blk_-7622891475776838399
> > The thing is that file was created, but with zero size.
> >
> > Do you have ideas why this happened?
> >
> > Thanks,
> > Victor
> >
> > On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:
> >
> > > You can use the hadoop command line on machines that aren't hadoop
> > servers.
> > > If you copy the hadoop configuration from one of your master servers or
> > > data
> > > node to the client machine and run the command line dfs tools, it will
> > copy
> > > the files directly to the data node.
> > >
> > > Or, you could use one of the client libraries.  The java client, for
> > > example, allows you to open up an output stream and start dumping bytes
> > on
> > > it.
> > >
> > > On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hi Jeff, thank you for answering!
> > > >
> > > > What about remote writing on HDFS, lets suppose I got an application
> > > server
> > > > on a
> > > > linux server A and I got a Hadoop cluster on servers B (master), C
> > > (slave),
> > > > D (slave)
> > > >
> > > > What I would like is sent some files from Server A to be processed by
> > > > hadoop. So in order to do so, what I need to do do I need send
> > those
> > > > files to master server first and then copy those to HDFS?
> > > >
> > > > or can I pass those files to any slave server?
> > > >
> > > > basically I'm looking for remote writing due to files to be process
> are
> > > not
> > > > being generated on any haddop server.
> > > >
> > > > Thanks again!
> > > >
> > > > -- Gerardo
> > > >
> > > >
> > > >
> > > > Regarding
> > > >
> > > > On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Gerardo:
> > > > >
> > > > > I can't really speak to all of your questions, but the master/slave
> > > issue
> > > > > is
> > > > > a common concern with hadoop.  A cluster has a single namenode and
> > > > > therefore
> > > > > a single point of failure.  There is also a secondary name node
> > process
> > > > > which runs on the same machine as the name node in most default
> > > > > configurations.  You can make it a different machine by adjusting
> the
> > > > > master
> > > > > file.  One of the more experienced lurkers should feel free to
> > correct
> > > > me,
> > > > > but my understanding is that the secondary name node keeps track of
> > all
> > > > the
> > > > > same index information used by the primary name node.  So, if the
> > > > namenode
> > > > > fails, there is no automatic recovery, but you can always tweak
> your
> > > > > cluster
> > > > > configuration to make the secondary namenode the primary and safely
> > > > restart
> > > > > the cluster.
> > > > >
> > > > > As for the storage of files, the name node is really just the
> traffic
> > > cop
> > > > > for HDFS.  No HDFS files are actually stored on that machine.  It's
> > > > > basically used as a directory and lock manager, etc.  The files are
> > > > stored
> > > > > 

Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-01 Thread Ryan LeCompte
Thanks, trying it now!

Ryan


On Mon, Sep 1, 2008 at 6:04 PM, Albert Chern <[EMAIL PROTECTED]> wrote:
> Increase the retry buffer size in jets3t.properties and maybe up the number
> of retries while you're at it.  If there is no template file included in
> Hadoop's conf dir you can find it at the jets3t web site.  Make sure that
> it's from the same version that your copy of Hadoop is using.
>
> On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> I'm trying to upload a fairly large file (18GB or so) to my AWS S3
>> account via bin/hadoop fs -put ... s3://...
>>
>> It copies for a good 15 or 20 minutes, and then eventually errors out
>> with a failed retry attempt (saying that it can't retry since it has
>> already written a certain number of bytes, etc. sorry don't have the
>> original error message at the moment). Has anyone experienced anything
>> similar? Can anyone suggest a workaround or a way to specify retries?
>> Should I use another tool for uploading large files to s3?
>>
>> Thanks,
>> Ryan
>>
>


Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-01 Thread Albert Chern
Increase the retry buffer size in jets3t.properties and maybe up the number
of retries while you're at it.  If there is no template file included in
Hadoop's conf dir you can find it at the jets3t web site.  Make sure that
it's from the same version that your copy of Hadoop is using.

On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I'm trying to upload a fairly large file (18GB or so) to my AWS S3
> account via bin/hadoop fs -put ... s3://...
>
> It copies for a good 15 or 20 minutes, and then eventually errors out
> with a failed retry attempt (saying that it can't retry since it has
> already written a certain number of bytes, etc. sorry don't have the
> original error message at the moment). Has anyone experienced anything
> similar? Can anyone suggest a workaround or a way to specify retries?
> Should I use another tool for uploading large files to s3?
>
> Thanks,
> Ryan
>


Error while uploading large file to S3 via Hadoop 0.18

2008-09-01 Thread Ryan LeCompte
Hello,

I'm trying to upload a fairly large file (18GB or so) to my AWS S3
account via bin/hadoop fs -put ... s3://...

It copies for a good 15 or 20 minutes, and then eventually errors out
with a failed retry attempt (saying that it can't retry since it has
already written a certain number of bytes, etc. sorry don't have the
original error message at the moment). Has anyone experienced anything
similar? Can anyone suggest a workaround or a way to specify retries?
Should I use another tool for uploading large files to s3?

Thanks,
Ryan


Re: Hadoop 101

2008-09-01 Thread Owen O'Malley
On Mon, Sep 1, 2008 at 2:27 AM, HHB <[EMAIL PROTECTED]> wrote:

>
> Hey,
> I'm reading about Hadoop lately but I'm unable to understand it.
> Would you please explain it to me in easy words?


Let me try. Hadoop is a framework that lets you write programs that work
with very large datasets in reasonable amounts of time using "normal"
computers instead of fancy servers.

It does this by using large numbers of computers (from 4 up to 3,000 or so)
to both store the data and process the data. When you write programs that
run on large number of computers, one of the primary requirements is that it
handles failures automatically, because you'll be losing a handful a day.
Hadoop handles the failures automatically for you both in terms of storage
on the local disks and computation.

Also take a look at the
presentationsabout
it. Google also has a lot video
lectures  about it.

>
> How to know if I can employ Hadoop in my current company?


The primary way that companies use Hadoop is to process large sets of log
data. Yahoo collects terabytes a day of user behaviors and wants to
understand them and uses Hadoop to do it. It also has lots of other uses.
See the powered by  Hadoop page for
examples of what users are doing with it.

-- Owen


Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

2008-09-01 Thread Allen Wittenauer



On 8/18/08 11:33 AM, "Filippo Spiga" <[EMAIL PROTECTED]> wrote:
> Well but I haven't understand how I should configurate HOD to work in this
> manner.
> 
> For HDFS I folllow this sequence of steps
> - conf/master contain only master node of my cluster
> - conf/slaves contain all nodes
> - I start HDFS using bin/start-dfs.sh

Right, fine...

> Potentially I would allow to use all nodes for MapReduce.
> For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
> change only the gridservice-hdfs section?

I was hoping the HOD folks would answer this question for you, but they
are apparently sleeping. :)

Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
use that as the -default- HDFS. That doesn't prevent a user from using HOD
to create a custom HDFS as part of their job submission.



Re: Load balancing in HDFS

2008-09-01 Thread Allen Wittenauer



On 8/27/08 7:51 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:

> This sound really interesting. And while increasing the replicas for
> certain files, the available troughput for these files increases too?

Yes, as there are more places to pull the file from.  This needs to get
weighed against the amount of work the name node will use to re-replicate
the file in case of failure and the total amount of disk space used... So
the extra bandwidth isn't "free".

> 
> Allen Wittenauer schrieb:
>> 
>> 
>> On 8/27/08 12:54 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
>>> i'am planning to use HDFS as a DFS in a web application evenvironment.
>>> There are two requirements: fault tolerence, which is ensured by the
>>> replicas and load balancing.
>> 
>> There is a SPOF in the form of the name node.  So depending upon your
>> needs, that may or may not be acceptable risk.
>> 
>> On 8/27/08 1:23 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
>>> Some documents stored in the HDFS could be very popular and
>>> therefor accessed more often then others. Then HDFS needs to balance the
>>> load - distribute the requests to different nodes. Is i possible?
>> 
>> Not automatically.  However, it is possible to manually/programmatically
>> increase the replication on files.
>> 
>> This is one of the possible uses for the new audit logging in 0.18... By
>> watching the log, it should be possible to determine which files need a
>> higher replication factor.
>> 
>> 
> 



Re: Timeouts at reduce stage

2008-09-01 Thread Jason Venner
We have trouble with that also, particularly when we have JMX enabled in 
our jobs.
We have modified the /main/ that launches the children of the task 
tracker to explicity exit, in it's finally block. That helps substantially.


We also have some jobs that do not seem to be killable by the 
Process.destroy method, we suspect to badly behaved external libraries 
being used via JNI.

This is in 0.16.

Иван wrote:
Thank you, this suggestion seems to be very close to the real situation. The cluster have already been left looping such a (relatively) frequently failing mapreduce jobs over a huge period of time to produce a more clear picture of the problem. And I've tried to investigate this suggestion more closely when I've read it. After taking a look at Ganglia monitoring system that's running on that same cluster it became clear that cluster's computing resources apparently are exhausted. Further step was quite simple and straightforward - just to login to the one random node and find out the consumer of server's resources. The answer became clear almost instantly because top and jps commands offered produced a huge list of orphaned TaskTracker$Child processes consuming tons of CPU time and RAM (in fact, almost all of them). Some other nodes even have run out of 16G RAM and few GB of swap and stopped responding at all. 


This situation apparently doesn't seems normal, I am going to try to repeat 
such a test with some simpler jobs (I think it would be something from Hadoop 
distribution to make sure that everything is fine with code) to find out more 
definitely whether this orphaning of forked processes depends on exact MR job 
running or not (theoretically it still could be something wrong with 
Hadoop/HBase configuration or even maybe with operating system, some additional 
installed software or, as it was suggested earlier, hardware).

I would be glad if someone could help me in this process by some advice (googling on this topic already proved to be hard because of $ being treated as separator and lookup usually results in materials about real childs). Maybe this situation is quite common and there is a definite reason or solution?  


Thanks!

Ivan Blinkov

-Original Message-
From: Karl Anderson <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Date: Fri, 29 Aug 2008 13:17:18 -0700
Subject: Re: Timeouts at reduce stage

  

On 29-Aug-08, at 3:53 AM, Иван wrote:


Thanks for a fast reply, but in fact it sometimes fails even on  
default MR jobs like, for example, rowcounter job from HBase 0.2.0  
distribution. Hardware problems are theoretically possible, but they  
doesn't seem to be the case because everything else is operating  
fine on the same set of servers. It seems that all major components  
of each server are fine, even disk arrays are regularly checked by  
datacenter stuff.
  
It could be due to a resource problem, I've found these hard to debug  
at times.  Tasks or parts of the framework can fail due to other tasks  
using up resources, and sometimes the errors you see don't make the  
cause easy to find.  I've had memory consumption in a mapper cause  
errors in other mappers, reducers, and fetching HDFS blocks, as well  
as job infrastructure failures that I don't really understand (for  
example, one task unable to find a file that was put in a job jar and  
found by other tasks).  I think all of my timeouts have been  
straightforward, but I could imagine resource consumption causing that  
in an otherwise unrelated task - IO blocking, swap, etc.





  


Re: basic questions about Hadoop!

2008-09-01 Thread Victor Samoylov
Gerardo,

Thank for you information.
I've success with remote writing on HDFS using the following steps:
1. Installation of the latest stable version (hadoop 0.17.2.1) to data nodes
and client machine.
2. Open ports 50010, 50070, 54310, 54311 on data nodes machines to access
from client machine and data nodes.
3. Usage of ./bin/hadoop dfs -put command to send files to remote HDFS.

Hope this help you.

Thanks,
Victor

On Sat, Aug 30, 2008 at 6:12 AM, Gerardo Velez <[EMAIL PROTECTED]>wrote:

> Hi Victor!
>
> I got problem with remote writing as well, so I tried to go further on this
> and I would like to share what I did, maybe you have more luck than me
>
> 1) as I'm working with user gvelez in remote host I had to give write
> access
> to all, like this:
>
>bin/hadoop dfs -chmod -R a+w input
>
> 2) After that, there is no more connection refused error, but instead I got
> following exception
>
>
>
> $ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
> cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
> 08/08/29 19:06:51 INFO dfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> jav
> a.io.IOException: File /user/hadoop/input/README.txt could only be
> replicated to
>  0 nodes, instead of 1
>at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
> va:1145)
>at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
>at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:585)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
>
>
> On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov <
> [EMAIL PROTECTED]
> > wrote:
>
> > Jeff,
> >
> > Thanks for detailed instructions, but on machine that is not hadoop
> server
> > I
> > got error:
> > ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
> > 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in
> createBlockOutputStream
> > java.net.ConnectException: Connection refused
> > 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
> > blk_-7622891475776838399
> > The thing is that file was created, but with zero size.
> >
> > Do you have ideas why this happened?
> >
> > Thanks,
> > Victor
> >
> > On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:
> >
> > > You can use the hadoop command line on machines that aren't hadoop
> > servers.
> > > If you copy the hadoop configuration from one of your master servers or
> > > data
> > > node to the client machine and run the command line dfs tools, it will
> > copy
> > > the files directly to the data node.
> > >
> > > Or, you could use one of the client libraries.  The java client, for
> > > example, allows you to open up an output stream and start dumping bytes
> > on
> > > it.
> > >
> > > On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hi Jeff, thank you for answering!
> > > >
> > > > What about remote writing on HDFS, lets suppose I got an application
> > > server
> > > > on a
> > > > linux server A and I got a Hadoop cluster on servers B (master), C
> > > (slave),
> > > > D (slave)
> > > >
> > > > What I would like is sent some files from Server A to be processed by
> > > > hadoop. So in order to do so, what I need to do do I need send
> > those
> > > > files to master server first and then copy those to HDFS?
> > > >
> > > > or can I pass those files to any slave server?
> > > >
> > > > basically I'm looking for remote writing due to files to be process
> are
> > > not
> > > > being generated on any haddop server.
> > > >
> > > > Thanks again!
> > > >
> > > > -- Gerardo
> > > >
> > > >
> > > >
> > > > Regarding
> > > >
> > > > On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Gerardo:
> > > > >
> > > > > I can't really speak to all of your questions, but the master/slave
> > > issue
> > > > > is
> > > > > a common concern with hadoop.  A cluster has a single namenode and
> > > > > therefore
> > > > > a single point of failure.  There is also a secondary name node
> > process
> > > > > which runs on the same machine as the name node in most default
> > > > > configurations.  You can make it a different machine by adjusting
> the
> > > > > master
> > > > > file.  One of the more experienced lurkers should feel free to
> > correct
> > > > me,
> > > > > but my understanding is that the secondary name node keeps track of
> > all
> > > > the
> > > > > same index information used by the primary name node.  So, if the
> > > > namenode
> > > > > fails, there is no automatic recovery, but you can always tweak
> your
> > > > > cluster
> > > > > configuration to make the secondary namenode the primary and safely
> > > > restart
> > > > > the cluster.
> > > > >
> > > > 

restarting datanode corrupts the hdfs

2008-09-01 Thread Barry Haddow
Hi 

Since upgrading to 0.18.0 I've noticed that restarting the datanode corrupts 
the hdfs so that the only option is to delete it and start again. I'm running 
hadoop in distributed mode, on a single host. It runs as the user hadoop and 
the hdfs is contained in a directory /home/hadoop/dfs.

When I restart hadoop using start-all.sh the datanode fails with the following 
message:

STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.0
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 686010; 
compiled by 'hadoopqa' on Thu Aug 14 19:48:33 UTC 2008
/
2008-09-01 12:06:55,871 ERROR org.apache.hadoop.dfs.DataNode: 
java.io.IOException: Found /home/hadoop/dfs/tmp/hadoop 
in /home/hadoop/dfs/tmp but it is not a file.
at 
org.apache.hadoop.dfs.FSDataset$FSVolume.recoverDetachedBlocks(FSDataset.java:437)
at org.apache.hadoop.dfs.FSDataset$FSVolume.(FSDataset.java:310)
at org.apache.hadoop.dfs.FSDataset.(FSDataset.java:671)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:277)
at org.apache.hadoop.dfs.DataNode.(DataNode.java:190)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2987)
at 
org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2942)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2950)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3072)

2008-09-01 12:06:55,872 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:

Running an fsck on the hdfs shows that it is corrupt, and the only way to fix 
it seems to be to delete it and reformat.

Any suggestions?
regards
Barry

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: datanodes in virtual networks.

2008-09-01 Thread Steve Loughran

Dmitry Pushkarev wrote:

Dear hadoop users,

 


Our lab in slowly switching from SGE to hadoop, however not everything seems
to be easy and obvious. We are in no way computer scientists, we're just
physicists, biologist and couple of statisticians trying to solve our
computational problems, please take this into consideration if questions
will look to you obvious..  


Our setup:

1.   Data cluster - 4 Raided and Hadooped servers, with 2TB of storage
each, they all have real IP addresses, one of them reserved for NameNode.

2.   Computational cluster:  100 dualcore servers running Sun Grid
Engine, they live on virtual network (10.0.0.X) and can connect to outside
world, but not accessible from out of the cluster. On these we don't have
root access, and these are shared via SGE with other people, who get
reasonably nervous when see idle reserved servers. 

 


Basic Idea is to create on-demand computational cluster,  which when needed
will reserve servers from second cluster run jobs and let them go.


OK. There are some hints that you can run hadoop atop SGE, though I've 
not tried it.




 


Currently it is done via script that reserves server for namenode 25 servers
for datanode copies data from first cluster, runs job, send result back and
releases servers. I still want to make them work together using one
namenode. 

 


After a week playing with hadoop I couldn't answer some of my question vie
thorough RTFM, so I'd really appreciate is you can answer at least some of
them in our context:



I'll answer the questions I can; leave the other q's to others

 


1.   Is it possible to connect servers from second cluster to first
namenode? What worries me is implementation of data-transfer protocol,
because some of the nodes cannot be reached but they can easily reach any
other node.  Will hadoop try to establish connection both ways to transfer
data between nodes?


There's an assumption that every datanode belongs to a single namenode. 
You can bring up task trackers on separate machines/networks from the 
job tracker, as long as they are set up to point to it. The task-tracker 
to job tracker communications should be ok; its the transfer of between 
the task tracker and the filesystem that you have to worry about.


 


2.   It is possible to specify "reliability" of the node, that is to
make replica on the node with raid installed counts as two replicas as
probability of failure is much lower. 


Not that I'm aware of.



 


3.   I also bumped into problems with decommissioning, after I add hosts
to free to dfs.hosts.exclude file and refreshNodes, they are marked as
"Decommission in progress" for days, even though data is removed from them
within first several minutes. What I currently do is shoot them down with
some delay, but I really hope to see "Decommissioned" one day. What am I
probably doing wrong?

 


4.   The same question about dead hosts. I do a simple exercise: I
create 20 datanodes on empty cluster, then I kill 15 of them and try to
store a file on HDFS, hadoop fails because some nodes that it thinks "in
service" aren't accessible. Is it possible to tell hadoop to remove these
nodes from the list and do not try to store data on them? My current
solution is hadoop-stop/start via cron every hour.


It sounds like the namenode should be doing more checking that the 
datanodes are live.




 


5.   We also have some external secure storage that can be accesses via
NFS from fists DATA cluster,  and it'd be great if I could somehow mount
this storage to HDFS folder and tell hadoop that all data written to that
folder shouldn't be replicated rather they should go directly to NFS.


You can certainly copy data in and out to NFS filestores without using 
HDFS; you can run tasks against NFS data without even running an HDFS 
filesystem. That is probably your best tactic. Trying to run HDFS on top 
of NFS is something that worries me; too many points of failure are 
being stacked up.





 


6.   Ironically none of us who uses cluster knows java, and most tasks
are launched via streaming with C++ programs/perl scripts.  The problem is
how to write/read files from HDFS in this context, we currently use things
like   -moveFromLocal  but it doesn't seems to be right answer, because it
slows things down a lot.

 


7.   On one of the DataCluster machines with run pretty large MySQL
database, and just thinking whether it is possible to spread database across
the cluster, has anyone tried that?



HBase



 


8.   Fuse-hdfs works great, but we really hope to be able to write to
HDFS someday, how to enable it?


There is a patch in SVN_HEAD for a thrift API to HDFS; this is 
accessible from C++ and perl





Re: datanodes in virtual networks.

2008-09-01 Thread Andrey Pankov
Hi, Dmitry!

Please, take a look into Webdav server for HDFS. It supports
read/write already, more details at http://www.hadoop.iponweb.net/

On Mon, Sep 1, 2008 at 7:28 AM, Dmitry Pushkarev <[EMAIL PROTECTED]> wrote:
> Dear hadoop users,
>
>
>
> Our lab in slowly switching from SGE to hadoop, however not everything seems
> to be easy and obvious. We are in no way computer scientists, we're just
> physicists, biologist and couple of statisticians trying to solve our
> computational problems, please take this into consideration if questions
> will look to you obvious..
>
> Our setup:
>
> 1.   Data cluster - 4 Raided and Hadooped servers, with 2TB of storage
> each, they all have real IP addresses, one of them reserved for NameNode.
>
> 2.   Computational cluster:  100 dualcore servers running Sun Grid
> Engine, they live on virtual network (10.0.0.X) and can connect to outside
> world, but not accessible from out of the cluster. On these we don't have
> root access, and these are shared via SGE with other people, who get
> reasonably nervous when see idle reserved servers.
>
>
>
> Basic Idea is to create on-demand computational cluster,  which when needed
> will reserve servers from second cluster run jobs and let them go.
>
>
>
> Currently it is done via script that reserves server for namenode 25 servers
> for datanode copies data from first cluster, runs job, send result back and
> releases servers. I still want to make them work together using one
> namenode.
>
>
>
> After a week playing with hadoop I couldn't answer some of my question vie
> thorough RTFM, so I'd really appreciate is you can answer at least some of
> them in our context:
>
>
>
> 1.   Is it possible to connect servers from second cluster to first
> namenode? What worries me is implementation of data-transfer protocol,
> because some of the nodes cannot be reached but they can easily reach any
> other node.  Will hadoop try to establish connection both ways to transfer
> data between nodes?
>
>
>
> 2.   It is possible to specify "reliability" of the node, that is to
> make replica on the node with raid installed counts as two replicas as
> probability of failure is much lower.
>
>
>
> 3.   I also bumped into problems with decommissioning, after I add hosts
> to free to dfs.hosts.exclude file and refreshNodes, they are marked as
> "Decommission in progress" for days, even though data is removed from them
> within first several minutes. What I currently do is shoot them down with
> some delay, but I really hope to see "Decommissioned" one day. What am I
> probably doing wrong?
>
>
>
> 4.   The same question about dead hosts. I do a simple exercise: I
> create 20 datanodes on empty cluster, then I kill 15 of them and try to
> store a file on HDFS, hadoop fails because some nodes that it thinks "in
> service" aren't accessible. Is it possible to tell hadoop to remove these
> nodes from the list and do not try to store data on them? My current
> solution is hadoop-stop/start via cron every hour.
>
>
>
> 5.   We also have some external secure storage that can be accesses via
> NFS from fists DATA cluster,  and it'd be great if I could somehow mount
> this storage to HDFS folder and tell hadoop that all data written to that
> folder shouldn't be replicated rather they should go directly to NFS.
>
>
>
> 6.   Ironically none of us who uses cluster knows java, and most tasks
> are launched via streaming with C++ programs/perl scripts.  The problem is
> how to write/read files from HDFS in this context, we currently use things
> like   -moveFromLocal  but it doesn't seems to be right answer, because it
> slows things down a lot.
>
>
>
> 7.   On one of the DataCluster machines with run pretty large MySQL
> database, and just thinking whether it is possible to spread database across
> the cluster, has anyone tried that?
>
>
>
> 8.   Fuse-hdfs works great, but we really hope to be able to write to
> HDFS someday, how to enable it?
>
>
>
> 9.   And may be someone can point out where to look for ways to specify
> how to partition data for the map jobs, in some our tasks processing of one
> line of input file takes several minutes, currently we split these files to
> many one-line files and process them independently, but a simple
> streaming-compatible way to tell hadoop that for example we want each job to
> take 10 lines or to split the 10kb input file into 1 map tasks would
> help as a lot!
>
>
>
>
>
>
>
> Thanks in advance.
>
>
>
>
>
>



-- 
Andrey Pankov


Re: Hadoop 101

2008-09-01 Thread tim robertson
I suggest reading up around map reduce first:
http://labs.google.com/papers/mapreduce-osdi04.pdf

Cheers



On Mon, Sep 1, 2008 at 11:27 AM, HHB <[EMAIL PROTECTED]> wrote:
>
> Hey,
> I'm reading about Hadoop lately but I'm unable to understand it.
> Would you please explain it to me in easy words?
> How to know if I can employ Hadoop in my current company?
> Thanks.
> --
> View this message in context: 
> http://www.nabble.com/Hadoop-101-tp19251524p19251524.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Hadoop 101

2008-09-01 Thread HHB

Hey,
I'm reading about Hadoop lately but I'm unable to understand it.
Would you please explain it to me in easy words?
How to know if I can employ Hadoop in my current company?
Thanks.
-- 
View this message in context: 
http://www.nabble.com/Hadoop-101-tp19251524p19251524.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



datanodes in virtual networks.

2008-09-01 Thread Dmitry Pushkarev
Dear hadoop users,

 

Our lab in slowly switching from SGE to hadoop, however not everything seems
to be easy and obvious. We are in no way computer scientists, we're just
physicists, biologist and couple of statisticians trying to solve our
computational problems, please take this into consideration if questions
will look to you obvious..  

Our setup:

1.   Data cluster - 4 Raided and Hadooped servers, with 2TB of storage
each, they all have real IP addresses, one of them reserved for NameNode.

2.   Computational cluster:  100 dualcore servers running Sun Grid
Engine, they live on virtual network (10.0.0.X) and can connect to outside
world, but not accessible from out of the cluster. On these we don't have
root access, and these are shared via SGE with other people, who get
reasonably nervous when see idle reserved servers. 

 

Basic Idea is to create on-demand computational cluster,  which when needed
will reserve servers from second cluster run jobs and let them go.

 

Currently it is done via script that reserves server for namenode 25 servers
for datanode copies data from first cluster, runs job, send result back and
releases servers. I still want to make them work together using one
namenode. 

 

After a week playing with hadoop I couldn't answer some of my question vie
thorough RTFM, so I'd really appreciate is you can answer at least some of
them in our context:

 

1.   Is it possible to connect servers from second cluster to first
namenode? What worries me is implementation of data-transfer protocol,
because some of the nodes cannot be reached but they can easily reach any
other node.  Will hadoop try to establish connection both ways to transfer
data between nodes?

 

2.   It is possible to specify "reliability" of the node, that is to
make replica on the node with raid installed counts as two replicas as
probability of failure is much lower. 

 

3.   I also bumped into problems with decommissioning, after I add hosts
to free to dfs.hosts.exclude file and refreshNodes, they are marked as
"Decommission in progress" for days, even though data is removed from them
within first several minutes. What I currently do is shoot them down with
some delay, but I really hope to see "Decommissioned" one day. What am I
probably doing wrong?

 

4.   The same question about dead hosts. I do a simple exercise: I
create 20 datanodes on empty cluster, then I kill 15 of them and try to
store a file on HDFS, hadoop fails because some nodes that it thinks "in
service" aren't accessible. Is it possible to tell hadoop to remove these
nodes from the list and do not try to store data on them? My current
solution is hadoop-stop/start via cron every hour.

 

5.   We also have some external secure storage that can be accesses via
NFS from fists DATA cluster,  and it'd be great if I could somehow mount
this storage to HDFS folder and tell hadoop that all data written to that
folder shouldn't be replicated rather they should go directly to NFS.

 

6.   Ironically none of us who uses cluster knows java, and most tasks
are launched via streaming with C++ programs/perl scripts.  The problem is
how to write/read files from HDFS in this context, we currently use things
like   -moveFromLocal  but it doesn't seems to be right answer, because it
slows things down a lot.

 

7.   On one of the DataCluster machines with run pretty large MySQL
database, and just thinking whether it is possible to spread database across
the cluster, has anyone tried that?

 

8.   Fuse-hdfs works great, but we really hope to be able to write to
HDFS someday, how to enable it?

 

9.   And may be someone can point out where to look for ways to specify
how to partition data for the map jobs, in some our tasks processing of one
line of input file takes several minutes, currently we split these files to
many one-line files and process them independently, but a simple
streaming-compatible way to tell hadoop that for example we want each job to
take 10 lines or to split the 10kb input file into 1 map tasks would
help as a lot!

 

 

 

Thanks in advance.