RE: Using hadoop as storage cluster?

2008-10-24 Thread David C. Kerber
Are there any tuning settings that can be adjusted to optimize for files of a 
given size range?

There would be quite a few files in the 100kB to 2MB range, which are received 
and processed daily, with smaller numbers ranging up to ~600MB or so which are 
summarizations of many of the daily data files, and maybe a handful in the 1GB 
-  6GB range (disk images and database backups, mostly).  There would also be a 
few (comparatively few, that is) configuration files of a few kB each.

Thanks for the response; do you know of any other systems with similar 
functionality?

Dave



> -Original Message-
> From: Alex Loddengaard [mailto:[EMAIL PROTECTED]
> Sent: Friday, October 24, 2008 5:42 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Using hadoop as storage cluster?
>
> What files do you expect to be storing?  Generally speaking,
> HDFS (Hadoop's distributed file system) does not handle small
> files very efficiently.
> Instead it's optimized for large files, upwards of 64MB each.
>
> Alex
>
> On Fri, Oct 24, 2008 at 9:41 AM, David C. Kerber <
> [EMAIL PROTECTED]> wrote:
>
> > Hi -
> >
> > I'm a complete newbie to hadoop, and am wondering if it's
> appropriate
> > for configuring a bunch of older machines that have no
> other use, for
> > use as a storage cluster on an otherwise windows network,
> so that my
> > windows clients see their combined disk space as a single
> large share?
> >
> > If so, will I need additional software to let the windows
> clients see
> > them (like Samba does for a single machine)?  We don't have
> a lot of
> > linux experience in our office, but probably enough to get
> this going
> > if it's not too complex; mostly with Ubuntu and Fedora.
> >
> > If hadoop isn't well-suited to this use, or there is
> something better,
> > I'm open to suggestions
> >
> > Thanks
> > --
> > David Kerber
> > Warren Rogers Associates
> > (800)-972-7472 x-111
> > [EMAIL PROTECTED]
> > www.WarrenRogersAssociates.com
> > --
> >
>


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Chris K Wensel


fyi, the src/contrib/ec2 scripts do just what Paco suggests.

minus the static IP stuff (you can use the scripts to login via  
cluster name, and spawn a tunnel for browsing nodes)


that is, you can spawn any number of uniquely named, configured, and  
sized clusters, and you can increase their size independently as well.  
(shrinking is another matter altogether)


ckw

On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:


Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:


On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:


This workflow could be initiated from a crontab -- totally  
automated.

However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved  
much

since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.


Same here.  I've always had the namenode and web interface be  
accessible,
but sometimes I don't get the slave nodes - usually zero slaves  
when this
happens, sometimes I only miss one or two.  My rough estimate is  
that this

happens 1% of the time.

I currently have to notice this and restart manually.  Do you have  
a good
way to detect it?  I have several Hadoop clusters running at once  
with the
same AWS image and SSH keypair, so I can't count running  
instances.  I could
have a separate keypair per cluster and count instances with that  
keypair,
but I'd like to be able to start clusters opportunistically, with  
more than

one cluster doing the same kind of job on different data.


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra






--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



no combiner in Hadoop-streaming?

2008-10-24 Thread mark meissonnier
Hi,
when I look at the doc there's

   -combiner JavaClassName

so does that mean that there is no stdin/stdout possibility to use a combiner?
I mean very often the reducer and the combiner are the same, so it's no work on 
the programmer's part?
Is there a reason why we can't have
-combiner mycombiner.py ?

Thanks
Mark

Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Steve Loughran

Paco NATHAN wrote:

Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.



We have a patch to add a ping() operation to all the services -namenode, 
datanode, task tracker, job tracker. With a proposal to make that 
remotely visible: https://issues.apache.org/jira/browse/HADOOP-3969 , 
you could hit every URL with an appropriately authenticated GET and see 
if it is live.



Another trick I like for long haul health checks is to use google talk 
and give every machine a login underneath (you can have them all in your 
own domain via google apps). Then you can use XMPP to monitor system 
health. A more advanced variant is to use it as a the command interface, 
which is very similar to how botnets work with IRC, though in that case 
the botnet herders are trying to manage 500,000 0wned boxes and don't 
want a reply.


Re: Using hadoop as storage cluster?

2008-10-24 Thread Alex Loddengaard
What files do you expect to be storing?  Generally speaking, HDFS (Hadoop's
distributed file system) does not handle small files very efficiently.
Instead it's optimized for large files, upwards of 64MB each.

Alex

On Fri, Oct 24, 2008 at 9:41 AM, David C. Kerber <
[EMAIL PROTECTED]> wrote:

> Hi -
>
> I'm a complete newbie to hadoop, and am wondering if it's appropriate for
> configuring a bunch of older machines that have no other use, for use as a
> storage cluster on an otherwise windows network, so that my windows clients
> see their combined disk space as a single large share?
>
> If so, will I need additional software to let the windows clients see them
> (like Samba does for a single machine)?  We don't have a lot of linux
> experience in our office, but probably enough to get this going if it's not
> too complex; mostly with Ubuntu and Fedora.
>
> If hadoop isn't well-suited to this use, or there is something better, I'm
> open to suggestions
>
> Thanks
> --
> David Kerber
> Warren Rogers Associates
> (800)-972-7472 x-111
> [EMAIL PROTECTED]
> www.WarrenRogersAssociates.com
> --
>


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
>
> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:
>>
>> This workflow could be initiated from a crontab -- totally automated.
>> However, we still see occasional failures of the cluster, and must
>> restart manually, but not often.  Stability for that has improved much
>> since the 0.18 release.  For us, it's getting closer to total
>> automation.
>>
>> FWIW, that's running on EC2 m1.xl instances.
>
> Same here.  I've always had the namenode and web interface be accessible,
> but sometimes I don't get the slave nodes - usually zero slaves when this
> happens, sometimes I only miss one or two.  My rough estimate is that this
> happens 1% of the time.
>
> I currently have to notice this and restart manually.  Do you have a good
> way to detect it?  I have several Hadoop clusters running at once with the
> same AWS image and SSH keypair, so I can't count running instances.  I could
> have a separate keypair per cluster and count instances with that keypair,
> but I'd like to be able to start clusters opportunistically, with more than
> one cluster doing the same kind of job on different data.
>
>
> Karl Anderson
> [EMAIL PROTECTED]
> http://monkey.org/~kra
>
>
>
>


Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Pete Wyckoff

Chukwa also could be used here.


On 10/24/08 11:47 AM, "Jeff Hammerbacher" <[EMAIL PROTECTED]> wrote:

Hey Edward,

The application we used at Facebook to transmit new data is open
source now and available at
http://sourceforge.net/projects/scribeserver/.

Later,
Jeff

On Fri, Oct 24, 2008 at 10:14 AM, Edward Capriolo <[EMAIL PROTECTED]> wrote:
> I came up with my line of thinking after reading this article:
>
> http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
>
> As a guy that was intrigued by the java coffee cup in 95, that now
> lives as a data center/noc jock/unix guy. Lets say I look at a log
> management process from a data center prospective. I know:
>
> Syslog is a familiar model (human readable: UDP text)
> INETD/XINETD is a familiar model (programs that do amazing things with
> STD IN/STD OUT)
> Variety of hardware and software
>
> I may be supporting an older Solaris 8, windows or  Free BSD 5 for example.
>
> I want to be able to pipe apache custom log at HDFS, or forward
> syslog. That is where LHadoop (or something like it) would come into
> play.
>
> I am thinking to even accept raw streams and have the server side use
> source-host/regex to determine what file the data should go to.
>
> I want to stay light on the client side. An application that tails log
> files and transmits new data is another component to develop and
> manage. Had anyone had experience with moving this type of data?
>




Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Karl Anderson


On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:


This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved much
since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.


Same here.  I've always had the namenode and web interface be  
accessible, but sometimes I don't get the slave nodes - usually zero  
slaves when this happens, sometimes I only miss one or two.  My rough  
estimate is that this happens 1% of the time.


I currently have to notice this and restart manually.  Do you have a  
good way to detect it?  I have several Hadoop clusters running at once  
with the same AWS image and SSH keypair, so I can't count running  
instances.  I could have a separate keypair per cluster and count  
instances with that keypair, but I'd like to be able to start clusters  
opportunistically, with more than one cluster doing the same kind of  
job on different data.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Jeff Hammerbacher
Hey Edward,

The application we used at Facebook to transmit new data is open
source now and available at
http://sourceforge.net/projects/scribeserver/.

Later,
Jeff

On Fri, Oct 24, 2008 at 10:14 AM, Edward Capriolo <[EMAIL PROTECTED]> wrote:
> I came up with my line of thinking after reading this article:
>
> http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
>
> As a guy that was intrigued by the java coffee cup in 95, that now
> lives as a data center/noc jock/unix guy. Lets say I look at a log
> management process from a data center prospective. I know:
>
> Syslog is a familiar model (human readable: UDP text)
> INETD/XINETD is a familiar model (programs that do amazing things with
> STD IN/STD OUT)
> Variety of hardware and software
>
> I may be supporting an older Solaris 8, windows or  Free BSD 5 for example.
>
> I want to be able to pipe apache custom log at HDFS, or forward
> syslog. That is where LHadoop (or something like it) would come into
> play.
>
> I am thinking to even accept raw streams and have the server side use
> source-host/regex to determine what file the data should go to.
>
> I want to stay light on the client side. An application that tails log
> files and transmits new data is another component to develop and
> manage. Had anyone had experience with moving this type of data?
>


Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Edward Capriolo
I came up with my line of thinking after reading this article:

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

As a guy that was intrigued by the java coffee cup in 95, that now
lives as a data center/noc jock/unix guy. Lets say I look at a log
management process from a data center prospective. I know:

Syslog is a familiar model (human readable: UDP text)
INETD/XINETD is a familiar model (programs that do amazing things with
STD IN/STD OUT)
Variety of hardware and software

I may be supporting an older Solaris 8, windows or  Free BSD 5 for example.

I want to be able to pipe apache custom log at HDFS, or forward
syslog. That is where LHadoop (or something like it) would come into
play.

I am thinking to even accept raw streams and have the server side use
source-host/regex to determine what file the data should go to.

I want to stay light on the client side. An application that tails log
files and transmits new data is another component to develop and
manage. Had anyone had experience with moving this type of data?


Using hadoop as storage cluster?

2008-10-24 Thread David C. Kerber
Hi -

I'm a complete newbie to hadoop, and am wondering if it's appropriate for 
configuring a bunch of older machines that have no other use, for use as a 
storage cluster on an otherwise windows network, so that my windows clients see 
their combined disk space as a single large share?

If so, will I need additional software to let the windows clients see them 
(like Samba does for a single machine)?  We don't have a lot of linux 
experience in our office, but probably enough to get this going if it's not too 
complex; mostly with Ubuntu and Fedora.

If hadoop isn't well-suited to this use, or there is something better, I'm open 
to suggestions

Thanks
--
David Kerber
Warren Rogers Associates
(800)-972-7472 x-111
[EMAIL PROTECTED]
www.WarrenRogersAssociates.com
--


Re: Task Random Fail

2008-10-24 Thread Mice
How many maximum mappers and reducers did you configure?
It seems your TaskRunner fails to get response.
Maybe you need to try increasing "mapred.job.tracker.handler.count".

2008/10/22, Zhou, Yunqing <[EMAIL PROTECTED]>:
> Recently the tasks on our cluster random failed (both map tasks and reduce
> tasks) . When rerun them, they are all ok.
> The whole job is a IO-bound job. (250G input and 500G output(map) and
> 10G(final))
> from the jobtracker, I can see the failed job says:
>task_200810220830_0004_m_000653_0
> tip_200810220830_0004_m_000653
>  vidi-005 
>  FAILED
>  java.io.IOException: Task process exit with nonzero status of 65. at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479) at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
>  Last
> 4KB
> Last
> 8KB
> All 
> and the log says (follow the link in the right-most column):
>
>  Task Logs: 'task_200810220830_0004_m_000653_0'
>
> *stdout logs*
>
> --
>
>
> *stderr logs*
>
> --
>
>
> *syslog logs*
>
> 2008-10-22 19:59:51,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2008-10-22 19:59:59,507 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 26
> 2008-10-22 20:12:25,968 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
>   at org.apache.hadoop.ipc.Client.call(Client.java:559)
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>   at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
>   at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
>   at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:13:29,015 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
>   at org.apache.hadoop.ipc.Client.call(Client.java:559)
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>   at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
>   at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
>   at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:14:32,030 INFO org.apache.hadoop.mapred.TaskRunner:
> Communication exception: java.net.SocketTimeoutException: timed out
> waiting for rpc response
>   at org.apache.hadoop.ipc.Client.call(Client.java:559)
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>   at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
>   at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
>   at java.lang.Thread.run(Thread.java:619)
>
> 2008-10-22 20:14:32,781 INFO org.apache.hadoop.mapred.TaskRunner:
> Process Thread Dump: Communication exception
> 9 active threads
> Thread 13 (Comm thread for task_200810220830_0004_m_000653_0):
>   State: RUNNABLE
>   Blocked count: 2
>   Waited count: 430
>   Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native Method)
> sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
> sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:168)
> org.apache.hadoop.mapred.Task$1.run(Task.java:338)
> java.lang.Thread.run(Thread.java:619)
> Thread 12 ([EMAIL PROTECTED]):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 872
>   Stack:
> java.lang.Thread.sleep(Native Method)
> org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:763)
> java.lang.Thread.run(Thread.java:619)
> Thread 11 (IPC Client connection to hadoop5/192.168.4.105:9000):
>   State: WAITING
>   Blocked count: 0
>   Waited count: 2
>   Waiting on [EMAIL PROTECTED]
>   Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> Thread 9 (IPC Client connection to /127.0.0.1:49078):
>   State: RUNNABLE
>   Blocked count: 5
>   Waited count: 214
>   Stack:
> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
> sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>
> org.apache.hadoop.net.SocketIOWithTime

Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Pete Wyckoff

Another way to do this is make thrift's java compiler generate REST bindings 
like its php compiler does and there is also libhdfs and 
http://wiki.apache.org/hadoop/MountableHDFS


On 10/23/08 2:54 PM, "Edward Capriolo" <[EMAIL PROTECTED]> wrote:

I had downloaded thrift and ran the example applications after the
Hive meet up. It is very cool stuff. The thriftfs interface is more
elegant than what I was trying to do, and that implementation is more
complete.

Still, someone might be interested in what I did if they want a
super-light API :)

I will link to http://wiki.apache.org/hadoop/HDFS-APIs from my page so
people know the options.




Re: Seeking Someone to Review Hadoop Article

2008-10-24 Thread Tom Wheeler
Thanks to such a helpful community, I have had several offers to
review my article and won't need any more volunteers.

I'll post a link to the article when it's published next week.

On Thu, Oct 23, 2008 at 5:31 PM, Tom Wheeler <[EMAIL PROTECTED]> wrote:
> Each month the developers at my company write a short article about a
> Java technology we find exciting. I've just finished one about Hadoop
> for November and am seeking a volunteer knowledgeable about Hadoop to
> look it over to help ensure it's both clear and technically accurate.
>...



-- 
Tom Wheeler
http://www.tomwheeler.com/


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Steve Loughran

Paco NATHAN wrote:

What seems to be emerging here is a pattern for another special node
associated with a Hadoop cluster.

The need is to have a machine which can:
   * handle setup and shutdown of a Hadoop cluster on remote server resources
   * manage loading and retrieving data via a storage grid
   * interact and synchronize events via a message broker
   * capture telemetry (logging, exceptions, job/task stats) from the
remote cluster

On our team, one of the engineers named it a CloudController, as
distinct from JobTracker and NameNode.

In the discussion here, the CloudController pattern derives from
services provided by AWS. However, it could just as easily be mapped
to other elastic services for servers / storage buckets / message
queues -- based on other vendors, company data centers, etc.


It really depends on how the other infrastructures allocate their 
machines. FWIW, the Ec2 APIs while simple, are fairly limited : you dont 
get to spec out your topology, or hint at the data you want wo work with.




This pattern has come up in several other discussions I've had with
other companies making large use of Hadoop. We're generally trying to
address these issues:
   * long-running batch jobs and how to manage complex workflows for them
   * managing trade-offs between using a cloud provider (AWS,
Flexiscale, AppNexus, etc.) and using company data centers
   * managing trade-offs between cluster size vs. batch window time
vs. total cost

Our team chose to implement this functionality using Python scripts --
replacing the shell scripts. That makes it easier to handle the many
potential exceptions of leasing remote elastic resources.  FWIW, our
team is also moving these CloudController scripts to run under
RightScale, to manage AWS resources more effectively -- especially the
logging after node failures.

What do you think of having an optional CloudController added to the
definition of a Hadoop cluster?


Its very much a management problem.

If you look at   https://issues.apache.org/jira/browse/HADOOP-3628
you can see the hadoop services slowly getting the ability to be 
started/stopped more easily; then we need to work on the configuration 
to remove the need to push out XML files to every node to change behaviour.


As a result, we can bring up clusters with different configurations on 
existing, VMWare allocated or remote (EC2) farms. I say that, with the 
caveat that I haven't been playing with EC2 recently on account of 
having more local machines to hand, including a new laptop with enough 
cores to act like its own mini-cluster.


slideware:
http://people.apache.org/~stevel/slides/deploying_on_ec2.pdf
http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf

I'm putting in for an apachecon eu talk on the topic, "Dynamic Hadoop 
Clusters".


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


HELP: Namenode Startup Failed with an OutofMemoryError

2008-10-24 Thread Yang Zhou
Hi everyone,

I have a problem about Hadoop startup.

I failed to startup the namenode and I got the following exception in the
namenode log file:
2008-10-23 21:54:51,223 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50070
2008-10-23 21:54:51,224 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-10-23 21:54:51,224 INFO org.apache.hadoop.fs.FSNamesystem: Web-server
up at: 0.0.0.0:50070
2008-10-23 21:54:51,224 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2008-10-23 21:54:51,226 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 58310: starting
2008-10-23 21:54:51,227 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 58310: starting
2008-10-23 21:54:51,227 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 58310: starting
2008-10-23 21:54:51,227 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 58310: starting
2008-10-23 21:54:51,227 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 58310: starting
2008-10-23 21:54:51,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 58310: starting
2008-10-23 21:54:51,232 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 58310: starting
2008-10-23 21:54:51,232 ERROR org.apache.hadoop.dfs.NameNode:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at org.apache.hadoop.ipc.Server.start(Server.java:991)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:149)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)

How can I fix it? Does it mean that my machine doesn't have enough memory
for Hadoop startup?
Any help will be appreciated!
Thanks in advance.

-Woody


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
What seems to be emerging here is a pattern for another special node
associated with a Hadoop cluster.

The need is to have a machine which can:
   * handle setup and shutdown of a Hadoop cluster on remote server resources
   * manage loading and retrieving data via a storage grid
   * interact and synchronize events via a message broker
   * capture telemetry (logging, exceptions, job/task stats) from the
remote cluster

On our team, one of the engineers named it a CloudController, as
distinct from JobTracker and NameNode.

In the discussion here, the CloudController pattern derives from
services provided by AWS. However, it could just as easily be mapped
to other elastic services for servers / storage buckets / message
queues -- based on other vendors, company data centers, etc.

This pattern has come up in several other discussions I've had with
other companies making large use of Hadoop. We're generally trying to
address these issues:
   * long-running batch jobs and how to manage complex workflows for them
   * managing trade-offs between using a cloud provider (AWS,
Flexiscale, AppNexus, etc.) and using company data centers
   * managing trade-offs between cluster size vs. batch window time
vs. total cost

Our team chose to implement this functionality using Python scripts --
replacing the shell scripts. That makes it easier to handle the many
potential exceptions of leasing remote elastic resources.  FWIW, our
team is also moving these CloudController scripts to run under
RightScale, to manage AWS resources more effectively -- especially the
logging after node failures.

What do you think of having an optional CloudController added to the
definition of a Hadoop cluster?

Paco

On Thu, Oct 23, 2008 at 10:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> Hey Stuart
>
> I did that for a client using Cascading events and SQS.
>
> When jobs completed, they dropped a message on SQS where a listener picked
> up new jobs and ran with them, or decided to kill off the cluster. The
> currently shipping EC2 scripts are suitable for having multiple simultaneous
> clusters for this purpose.
>
> Cascading has always and now Hadoop supports (thanks Tom) raw file access on
> S3, so this is quite natural. This is the best approach as data is pulled
> directly into the Mapper, instead of onto HDFS first, then read into the
> Mapper from HDFS.
>
> YMMV
>
> chris
>
> On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:
>
>> Hi folks,
>> Anybody tried scripting Hadoop on EC2 to...
>> 1. Launch a cluster
>> 2. Pull data from S3
>> 3. Run a job
>> 4. Copy results to S3
>> 5. Terminate the cluster
>> ... without any user interaction?
>>
>> -Stuart
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>


Re: HELP: Namenode Startup Failed with an OutofMemoryError

2008-10-24 Thread Steve Loughran

woody zhou wrote:

Hi everyone,

I have a problem about Hadoop startup.

I failed to startup the namenode and I got the following exception in the
namenode log file:



2008-10-23 21:54:51,232 ERROR org.apache.hadoop.dfs.NameNode:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at org.apache.hadoop.ipc.Server.start(Server.java:991)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:149)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)

How can I fix it? Does it mean that my machine doesn't have enough memory
for Hadoop startup?
Any help will be appreciated!
Thanks in advance.

- Woody


Not memory; its an OS-level limit on the number of threads a process can 
have. Which is a mixture of physical and virtual memory allocation 
per-thread and any limits coded into the kernel.


-search the web for the string "unable to create new native thread" and 
you will find more details and workarounds; include the OS you run on 
for some specific workarounds. For Hadoop, I'd consider throttling back 
the number of helpers


1. Have a look at the value of dfs.namenode.handler.count and set it to 
something lower


2. you can use kill -QUIT to get a dump of all threads in your process 
-this lets you see how many you have, and where they are.