Re: java.io.IOException: Function not implemented

2010-03-31 Thread Steve Loughran

Edson Ramiro wrote:

May be it's a bug.

I'm not the admin. : (

so, I'll talk to him and may be he install a 2.6.32.9 in another node to
test  : )

Thanks

Edson Ramiro


On 30 March 2010 20:00, Todd Lipcon  wrote:


Hi Edson,

I noticed that only the h01 nodes are running 2.6.32.9, the other broken
DNs
are 2.6.32.10.

Is there some reason you are running a kernel that is literally 2 weeks
old?
I wouldn't be at all surprised if there were a bug here, or some issue with
your Debian "unstable" distribution...



If you are running the SCM trunk of the OS, you are part of the dev 
team. They will be grateful for the bugs you find and fix, but you get 
to find and fix them. In Ant one bugrep was  stopped setting 
dates in the past, turned out that on the debian nightly builds, you 
couldn't touch any file into the past...


-steve





Re: Single datanode setup

2010-03-30 Thread Steve Loughran

Ed Mazur wrote:

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?


I wouldn't use hdfs in this situation. Your network will be the 
bottleneck. If you have a SAN, high end filesystem and/or fast network, 
just use file:// URLs and let the underlying OS/network handle it. I 
know people who use alternate filesystems this way. Side benefit: the NN 
is longer an SPOF. Just your storage array. But they never fail, right?


Having a single DN and NN is a waste of effort here. There's no 
locality, no replication, so no need for the replication and locality 
features of HDFS. Try mounting the filestore everywhere with NFS (or 
other protocol of choice), and skip HDFS entirely.


-Steve



Re: why does 'jps' lose track of hadoop processes ?

2010-03-30 Thread Steve Loughran

Marcos Medrado Rubinelli wrote:
jps gets its information from the files stored under /tmp/hsperfdata_*, 
so when a cron job clears your /tmp directory, it also erases these 
files. You can submit jobs as long as your jobtracker and namenode are 
responding to requests over TCP, though.


I never knew that.

ps -ef | grep java works quite well; jps has fairly steep startup costs 
and if a JVM is playing up, jps can hang too




Re: java.io.IOException: Function not implemented

2010-03-30 Thread Steve Loughran

Edson Ramiro wrote:

I'm not involved with Debian community :(


I think you are now...


Re: hadoop conf for dynamically changing ips

2010-03-26 Thread Steve Loughran

On 26/03/2010 16:39, Gokulakannan M wrote:


It's what I do. You just have to make sure that if the IPAddrs change,

everything gets restarted.



Thanks Steve for the reply. "everything must be restarted" means the
hadoop cluster or all the systems.


just the JVMs. All JVMs cache mapped IP addresses forever, very annoying.

http://issues.apache.org/jira/browse/HDFS-34

Dyndns and the like don't get a look in. It's for applet security, but 
interferes with long-lived apps. Another reason to automate restarting 
your apps regularly.




Re: hadoop conf for dynamically changing ips

2010-03-26 Thread Steve Loughran

On 26/03/2010 14:31, Gokulakannan M wrote:



  Hi,



 I have a LAN in which the IPs of the machines will be changed
dynamically by the DHCP sever.



 So for namenode, jobtracker, master and slave configurations we
could not give the IP.

 can the machine names be given for those configurations will
that work??



It's what I do. You just have to make sure that if the IPAddrs change, 
everything gets restarted.




Re: JobTracker startup failure when starting hadoop-0.20.0 cluster on Amazon EC2 with contrib/ec2 scripts

2010-03-25 Thread Steve Loughran

毛宏 wrote:

I downloaded Hadoop 0.20.0 and used the src/contrib/ec2/bin scripts to
launch a Hadoop cluster on Amazon EC2, after building a new Hadoop
0.20.0 AMI. 


I launched an instance with my new Hadoop 0.20.0 AMI, then logged in and
ran the following to launch a new cluster:
root(/vol/hadoop-0.20.0)> bin/launch-hadoop-cluster hadoop-test 2

After the usual EC2 wait, one master and two slave instances were
launched on EC2, as expected. When I ssh'ed into the instances, here is
what I found:

Slaves: DataNode and NameNode are running
Master: Only NameNode is running

I could use HDFS commands (using $HADOOP_HOME/bin/hadoop scripts)
without any problems, from both master and slaves. However, since
JobTracker is not running, I cannot run map-reduce jobs.



2009-09-03 18:55:48,628 INFO org.apache.hadoop.hdfs.DFSClient:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/mnt/hadoop/mapred/system/jobtracker.info could only be replicated to 0
nodes, 
instead of 1

at




2009-09-03 18:55:48,628 WARN org.apache.hadoop.hdfs.DFSClient:
NotReplicatedYetException sleeping
/mnt/hadoop/mapred/system/jobtracker.info retries left 4
2009-09-03 18:55:49,030 INFO org.apache.hadoop.hdfs.DFSClient:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/mnt/hadoop/mapred/system/jobtracker.info could only be replicated to 0
nodes, 
instead of 1


the JT isn't up as the datanodes aren't taking data, JT spins waiting 
for files to be writeable so it can save state.


I cheat in my clusters by running a (small) datanode in the root VM, so 
it will come up without needing any more.


check more about the DN/HDFS status, that looks like the first problem.


Re: a question about tasktracker in hadoop

2010-03-22 Thread Steve Loughran

毛宏 wrote:

I read from 《Towards Optimizing Hadoop Provisioning in the Cloud 》
saying that "mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum respectively set the maximum
number of parallel mappers and reducers that can run on a Hadoop
slave".   


It means that a tasktracker in Hadoop is running more than one map-task
or reduce-task, and may be a tasktracker is running M map-tasks and R
reduce-tasks. Is that right?   



yes.

There's been some discussion on having more dynamic reporting of 
availability

https://issues.apache.org/jira/browse/MAPREDUCE-1603

you are welcome to implement this feature, with tests and documentation



Re: Distributed hadoop setup 0 live datanode problem in cluster

2010-03-22 Thread Steve Loughran

William Kang wrote:

Hi Jeff,
I think I partly found out the reasons of this problem. The /etc/hosts
127.0.0.1 has the master's host name in it. And the namenode took 127.0.0.1
as the ip address of the namenode. I fixed it and I already found two nodes.
There is one still missing. I will let you guys know what happened.
Thanks.



ongoing issue w/ Ubuntu. It's a feature designed to help laptops and the 
like, useless for when you want to bring up servers visible from the 
outside and addressed by hostname, which is what Hadoop and things like 
Java RMI like.


see http://linux.derkeiler.com/Mailing-Lists/Ubuntu/2007-08/msg00681.html
http://ubuntuforums.org/showthread.php?t=432875
https://lists.ubuntu.com/archives/ubuntu-users/2008-December/168883.html


I've debated fixing this at a pre-hadoop level in my code by having 
checks for the hostname and bailing out early if your hostname maps to 
::1 or 127.*.*.1, because neither are things that are useful off the 
specific host:

http://jira.smartfrog.org/jira/browse/SFOS-1184


Something could go into Hadoop directly, but either way you have a 
problem that hostname/DNS problems can be fairly tricky to identify in a 
timely manner, let alone handle.


-steve


Re: How namenode/datanode and jobtracker/tasktracker negotiate connection?

2010-03-09 Thread Steve Loughran

jiang licht wrote:

What are the exact packets and steps used to establish a namenode/datanode 
connection and  jobtracker/tasktracker connection?

I am asking this due to a weird problem related to starting datanodes and tasktrackers. 


In my case, the namenode box has 2 ethernet interfaces combined as bond0 
interface with IP address of IP_A and there is an IP alias IP_B for local 
loopback interface as lo:1. All slave boxes sit on the same network segment as 
IP_B.

The network is configured such that no slave box can reach namenode box at IP_A but namenode box can reach slave 
boxes (clearly can only routed from bond0). So, slave boxes always use "hdfs://IP_B:50001" as 
"fs.default.name" in "core-site.xml" and use IP_B:50002" for job tracker in 
mapred-site.xml to reach namenode box.

There are the following 2 cases how namenode (or jobtracker) is configured on 
namenode box.

Case #1: If I set "fs.default.name" to "hdfs://IP_B:50001", no slave boxes can join the cluster as data nodes 
because the request to IP_B:50001 failed. "telnet IP_B 50001" on slave boxes resulted in connection refused. So, on 
namenode box, I fired "tcpdump -i bond0 tcp port 50001" and then from a slave box did a "telnet IP_B 5001" 
and watched for incoming and outgoing packets on namenode box.

Case #2: If I set "fs.default.name" to "hdfs://IP_A:50001", slave boxes can 
join the cluster as data nodes. And I did the same thing to use tcpdump and telnet to watch the 
traffic. I compared these two cases and found some difference in the traffic. So, I want to know if 
there is a hand-shaking stage for namenode and datanode to establish a connection and what are the 
packets for this purpose so that I can figure out if packets exchanged in case #1 are correct or 
not, which may reveal why the connection request from data node to name node fails.

Also in Case #2, although all slave boxes can join the cluster as datanodes, no slave box 
can start as a tasktracker because at the beginning of starting a tasktracker, the 
tasktracker box uses IP_A:50001 to request connection to namenode and as mentioned above 
(slaves are not allowed to reach namenode at IP_A but reverse direction is ok), this 
cannot be done. But my confusion here is that on all slave boxes 
"fs.default.name" is set to use IP_B:50001, how come it ended up with 
contacting the namenode with IP_A:50001?

A bit complicated. But any thoughts?



the NN listens on the card given by the IP address of its hostname; it 
does not like people connecting to it using a different hostname than 
the one it is on (irritating, something to fix)


It sounds like you have DNS problems. you should have a consistent 
mapping from hostname<-->IP Addr across the entire cluster, but the 
issues you have indicate this may not be the case.




Re: I got java crash error when writing SequenceFile

2010-03-08 Thread Steve Loughran

forbbs forbbs wrote:

*I just ran a simple program, and got the error below:*

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0x0030bda07927, pid=22891, tid=1076017504
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode 
linux-amd64 )

# Problematic frame:
# C  [ld-linux-x86-64.so.2+0x7927]
#
# An error report file with more information is saved as:
# /home1/wwp/amy/SDB/hs_err_pid22891.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#


1. file the bug as sun ask
2. roll back to the 6u15 JVM, it appears to be better than the .18 release


Re: Failed to set setXIncludeAware(true) for parser

2010-03-08 Thread Steve Loughran

Ted Yu wrote:

Hi,
We use nutch 1.0
In nutch, we define the following according to
http://issues.apache.org/jira/browse/HADOOP-5254:

NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"

But we still see:
ERROR org.apache.hadoop.conf.Configuration: Failed to set
setXIncludeAware(true) for parser
org.apache.xerces.jaxp.documentbuilderfactoryi...@17fe1feb:java.lang.UnsupportedOperationException:
This parser does not support specification "null" version "null"
java.lang.UnsupportedOperationException: This parser does not support
specification "null" version "null"

Can someone provide hint why the above error still appears ?

Here is the command line (I symlinked jdk's rt.jar as 1rt.jar which appears
before xerces-2_6_2.jar in the classpath):


-not sure that helps, as it complicates factory stuff.

Try

ant -diagnostics, post the results here as that does some XML parser 
diagnostics work




Re: Tools to automatically setup new slaves (PXE boot?)

2010-03-03 Thread Steve Loughran

sagar_shukla wrote:

Hi Paul,
Alternative to PXE boot is - do the installation on a single box, create a 
ghost image of that setup and then replicate ghost image on new servers.

PXE boot is mainly used when you do not have physical access to the servers and 
installation needs to be done remotely. If you have physical access to the 
servers then images can be created through ghost.



No, that way leads to madness.

With PXE and Config Management tools then you can go from clean metal to 
managed machine in a period of time, sure that the state of your machine 
is a function of (packages, config-changes). You can also be sure that 
the state of all the machine's   root filesystems is defined by your 
tooling. The big problem becomes the network load when you restart 2000 
nodes due to a  UPS failure, for which workarounds exist.


Initial costs are high, but the cost of adding new machines should be 0, 
diagnosing problems fairly simple as each node will look the same.


Whenever you ghost a machine, you fork that machine image, The state of 
every machine will vary, and is a function of (original state, 
ghosting-changes, time). How do you keep them up to date? By hand? Do 
you tell them to auto update? But on what schedule? What if a reboot is 
required?


Now, what happens when there's a problem? Do you know what state every 
machine in the datacentre is in? How you diagnose faults if one machine 
is playing up? How do you reset that one to the "gold" image and then 
bring it up to date?


Next problem: updating your binaries. How? By hand? By creating a new 
machine and ghosting it to every machine again?


Like I said, madness.

It's easier on virtualised systems if you only have one machine image 
that is transient: you can bring it up by hand, update it, let the 
infrastructure apply per-machine diffs (hostname, wierd things in 
domained windows boxes). Even then, the word "by hand" is a warning 
sign. Me, I have hudson creating RPMs, scp-ing them to machines that can 
then copy them out to mounted virtual disk images that are then mounted 
read-only on the various VMs, these disk images do the late binding 
installation of my RPM-packaged JARS (including Hadoop) so I don't have 
to worry about the fact that I may want to push out three updates in a 
single day. Because once you move to an VM-on-demand infrastructure, 
with an API, building machine images by hand is like typing LINK on the 
command line as you build your C++ program.


-steve


Re: Tools to automatically setup new slaves (PXE boot?)

2010-03-03 Thread Steve Loughran

Edward Capriolo wrote:



In a redhat environment PXE+ KICKSTART is a great way to go. You can
get nice fast consistent builds automatically. The post section of a
kickstart allows you to run scripts or install RPM's.

There are some tools Kobbler that give you some nice PXE install
management, and tools like CFEngine can manage your system after that.


There's some stuff to make pxe/kickstart easier, like : 
http://linuxcoe.sourceforge.net/


I don't have any config in the installation, so can use the same machine 
image for any node type; push out config later so that  they become the 
node they are told to, bonded to the other bits of the cluster


Re: Hadoop as master's thesis

2010-03-02 Thread Steve Loughran

Song Liu wrote:

Hi, Tonci, Actually, I am taking a Master's thesis by developing algorithms
on hadoop.

My project is to extend algorithms into mapreduce fasion and to discover
whether there is a optimal choice.  Most of them belong to the Machine
Learning area. Personally, I think this is a fresh area, and if you search
the main academic database, you may find few literature about this.

I recently made an proposal about my study on Hadoop, and I would like to
discuss this with you in depth if you wish.

Another interesting topic is to discover the limit of hadoop. We have a very
large cluster at a very high rank among TOP500, so I'm wondering whether
hadoop can perform as we expected.



A lot of the big clusters have premium network infrastructure and SAN 
mounted storage whose access times are independent of location. 
MapReduce is designed to work on lower-cost storage/network 
infrastructure, saving money there that you can spend on more servers 
and storage. But it does require algorithms to work on local data only, 
or the LAN becomes a bottleneck, fast.





Re: Hadoop as master's thesis

2010-03-02 Thread Steve Loughran

Matteo Nasi wrote:

Hi all,
I just completed my first level of university degree at Politecnico di
Milano Italy (Computer Science Engineering) with a thesis on Hadoop: "log
analysis in the cloud", using and comparing custom log analysis script on
local private cluster (8 nodes of old computers) and AWS EMR hadoop
implementation. I wrote scripts in Pig and Hive and collected results into a
custom web interface.


Stick it up online and we will link to it from the Hadoop wiki




Re: Sun JVM 1.6.0u18

2010-03-02 Thread Steve Loughran

Allen Wittenauer wrote:



On 3/1/10 7:24 AM, "Edward Capriolo"  wrote:

u14 added support for the 64bit compressed memory pointers which
seemed important due to the fact that hadoop can be memory hungry. u15
has been stable in our deployments. Not saying you should not go
newer, but I would not go older then u14.


How are the compressed memory pointers working for you?  I've been debating
turning them on here, so real world experience would be useful from those
that have taken plunge.



I used JRockit for a long time, which had compressed on for ages. Some 
Hadoop problems, filed as bugreps. One of the funniest is that JRockit 
stacks can get way bigger than JVM stacks, so some functional tests of 
mine that recursed and expected to OOM instead timed out.


On the Sun JVMS, not used them at "datacentre scale" but found memory 
savings in everything from IDEs up.


It'd be nice if there was an easy way to turn it on as default for 
everything, no faffing around with app-specific ops


Re: Hadoop as master's thesis

2010-03-01 Thread Steve Loughran

Tonci Buljan wrote:

Hello everyone,

 I'm thinking of using Hadoop as a subject in my master's thesis in Computer
Science. I'm supposed to solve some kind of a problem with Hadoop, but can't
think of any :)).


well, you need some interesting data, then mine it. So ask around. 
Physicists often have stuff.


Re: problem building trunk

2010-03-01 Thread Steve Loughran

Massoud Mazar wrote:

Thanks. I followed steps at http://wiki.apache.org/hadoop/GitAndHadoop and was 
able to do a successful build.


those are my ongoing notes, feel free to correct them; I am still trying 
to understand Git enough to be able to use it


One problem is that the POM files created still include the -SNAPSHOT 
tag of their dependencies (so hadoop-mapreduce) declares a dependency on 
-SNAPSHOT of -hdfs and -core, even if you give everything version 
numbers. I work around this in by system by not checking in the POM 
files to SCM for other bits of my build (that take the version tagged 
hadoop releases I make).





Re: Sun JVM 1.6.0u18

2010-03-01 Thread Steve Loughran

Todd Lipcon wrote:

On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey wrote:





I have found some notes that suggest that "-XX:-ReduceInitialCardMarks"
will work around some known crash problems with 6u18, but that may be
unrelated.



Yep, I think that is probably a likely workaround as well. For now I'm
recommending downgrade to our clients, rather than introducing cryptic XX
flags :)



lots of bugreps come in once you search for ReduceInitialCardMarks

Looks like a feature has been turned on :
http://bugs.sun.com/view_bug.do?bug_id=6889757

and now it is in wide-beta-test

http://bugs.sun.com/view_bug.do?bug_id=698
http://permalink.gmane.org/gmane.comp.lang.scala/19228

Looks like the root cause is a new Garbage Collector, one that is still 
settling down. The ReduceInitialCardMarks flag is tuning the GC, but it 
is the GC itself that is possibly playing up, or it is a old GC + some 
new features. Either way: trouble.


-steve


Re: Is it possible to run multiple mapreduce jobs from within the same application

2010-02-24 Thread Steve Loughran

Raymond Jennings III wrote:

In other words:  I have a situation where I want to feed the output from the first iteration of my mapreduce 
job to a second iteration and so on.  I have a "for" loop in my main method to setup the job 
parameters and to run it through all iterations but on about the third run the Hadoop processes lose their 
association with the 'jps' command and then weird things start happening.  I remember reading somewhere about 
"chaining" - is that what is needed?  I'm not sure what causes jps to not report the hadoop 
processes even though they are still active as can be seen with the "ps" command.  Thanks.  (This 
is on version 0.20.1)


  


yes, here is something that does this to do a pagerank style ranking of 
things


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/extras/citerank/src/org/smartfrog/services/hadoop/benchmark/citerank/CiteRank.java?revision=7728&view=markup


Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

Eli Collins wrote:

From what I read, I thought, that bookkeeper would be the ideal enhancement
for the namenode, to make it distributed and therefor finaly highly available.


Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).



There's another availability, "engineering availability". What we have 
today is nice in that the HDFS engineering skills are distributed among 
a number of companies, and the source is there for anyone else to learn, 
rebuilding is fairly straightforward.


Don't dismiss that as unimportant. Engineering availability means that 
if you discover a new problem, you have the ability to patch your copy 
of the code, and keep going, while filing a bug report for others to 
deal with. That significantly reduces your downtime on a software 
problem compared to filing a bugrep and hoping that a future release 
will have addressed it


-steve



Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

E. Sammer wrote:

On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote:
My 2 cents: If the NN stores all state behind a javax.Cache façade it 
will be possible to use all kinds of products (commercial, open 
source, facades to ZK) for redundancy, load balancing, etc.


This would be pretty interesting from a deployment / scale point of view 
in that many jcache providers support flushing to disk and the concept 
of distribution and near / far data. Now, all of that said, this also 
removes some certainty from the name node contract, namely:


If jcache was used we:
 - couldn't promise data would be in memory (both a plus and a minus).
 - couldn't promise data is on the same machine.
 - can't make guarantees about consistency (flushes, PITR features, etc.).

It may be too general an abstraction layer for this type of application. 
It's a good avenue to explore, just playing devil's advocate.


Regards.


I know nothing about HA, though I work with people who do. They normally 
start their conversations with "Steve, you idiot, you don't understand 
...". Because HA is all or nothing: either you are HA or you aren't. 
It's also somewhere you need to reach for the mathematics to prove works 
in theory, then test in the field in interesting situations. Even then, 
they have horror stories.


We all know that NN's have limits, but most of those limits are known 
and documented:

 -doesn't like full disks
 -a corrupted editlog doesn't replay
 -if the 2ary NN isn't live, the NN will stay up (It's been discussed 
having it somehow react if there isn't any secondary around and you say 
you require one)


There's also an open issue that may be related to UTF-8 translation that 
is confusing restarts, under discussion in -hdfs right now.


What is best is this: everyone has the same code base, one that is 
tested at scale.


If you switch to multiple back ends, then nobody other than the people 
who have the same back end as you will be able to replicate the problem. 
You lose the "Yahoo! and Facebook run on PetaByte sized clusters, so my 
40 TB is noise" argument, replacing it with "Yahoo! and Facebook run on 
one cache back end, I use something else and am on my own when it fails".


I don't want to go there




Re: HDFS vs Giant Direct Attached Arrays

2010-02-22 Thread Steve Loughran

Allen Wittenauer wrote:



On 2/19/10 11:08 AM, "Edward Capriolo"  wrote:

Now, imagine if this was a 6 node hadoop systems with 8 disks a node,
and we had to do a firmware updates. Wow! this would be easy. We could
accomplish this with no system-wide outage, at our leisure. With a
file replication factor of 3 we could hot swap disks, or even safely
fail an entire node with no outage.


... except for the NN and JT nodes. :)



or DNS.

That plane crash in Palo Alto last week took out some of our wifi user 
authentication stuff I needed in the UK to get my laptop on the network 
in a meeting room, thus forcing me to pay attention instead.


Remember, there's always a SPOF.

-steve


Re: JavaDocs for DistCp (or similar)

2010-02-18 Thread Steve Loughran

Tsz Wo (Nicholas), Sze wrote:

Oops, DistCp.main(..) calls System.exit(..) at the end.  So it would also 
terminate your Java program.  It probably is not desirable.  You may still use 
similar codes as the ones in DistCp.main(..) as shown below.  However, they are 
not stable APIs.


//DistCp.main
  public static void main(String[] args) throws Exception {
JobConf job = new JobConf(DistCp.class);
DistCp distcp = new DistCp(job);
int res = ToolRunner.run(distcp, args);
System.exit(res);
  }


sorry, just replied saying roughly the same thing. Adding a formal API 
would be useful, as DistCP's implementation of Tool.run does assume that 
system.err is the right place to log,





Re: JavaDocs for DistCp (or similar)

2010-02-18 Thread Steve Loughran

Tsz Wo (Nicholas), Sze wrote:

Hi Balu,

Unfortunately, DistCp does not have a public Java API.  One simple way is to 
invoke DistCp.main(args) in your java program, where args is an array of the 
string arguments you would pass in the command line.


That's a method with System.exit() in, so either you run under a 
security manager or your app fails without warning


Better to create your own Configuration instance, then a new DistCp object

DistCp distcp = new DistCp(conf);
int res = ToolRunner.run(distcp, args);

It's still going to log at System.out/System.err instead of a log API, 
but your JVM should stay around without you having to jump through 
security manager hoops




Re: Issue with Hadoop cluster on Amazon ec2

2010-02-17 Thread Steve Loughran

viral shah wrote:

Hi,

We have deployed hadoop cluster on EC2, hadoop version 0.20.1.
We are having couple of data nodes.
We want to get some files from the data node which is there on the amazon
ec2 instance to our local instance using java application, which in turn use
SequentialFile.reader to read file.
The problem is amazon uses private IP for host communication, but to connect
form the environment other than amazon we will be using public IP.
So when we try to connect to the data nodes via name node, it will report
data node's private IP and using the same we are not able to reach the data
node.


That's a feature to stop you accidentally exporting your entire HDFS 
filesystem to the rest of the world.



Is there any way we can set name node to send data nodes public NAT IP not
the internal IP, or any other work around is there to overcome this problem.


-push up the data to the s3 filestore first, have the job sequence start 
from s3 and finish there too




Re: Why is $JAVA_HOME/lib/tools.jar in the classpath?

2010-02-16 Thread Steve Loughran

Thomas Koch wrote:

Hi,

I'm working on the Debian package for hadoop (the first version is already in 
the new queue for Debian unstable).
Now I stumpled about $JAVA_HOME/lib/tools.jar in the classpath. Since Debian 
supports different JAVA runtimes, it's not that easy to know, which one the 
user currently uses and therefor I'd would make things easier if this jar 
would not be necessary.
From searching and inspecting the SVN history I got the impression, that this 
is an ancient legacy that's not necessary (anymore)?




I don't think hadoop core/hdf/maperd needs it. The only place where it 
would be needed is JSP->java->binary work, but as the JSPs are 
precompiled you can probably get away without it. Just add tests for all 
the JSPs to make sure they work.



-steve


Re: datanode can not connect to the namenode

2010-02-15 Thread Steve Loughran

Marc Sturlese wrote:

Hey there I have a hadoop cluster build with 2 servers. One node (A) contains
the namenode, a datanode, the jobtraker and a tasktraker.
The other node(B) just has a datanode and a tasktraker.
I set up correctly hdfs with ./start-hdfs.sh

When I try to set up MapReduce with ./start-mapred.sh the TaskTraker of node
(B) can not connect to the namenode. The log will keep throwing:

INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
mynamenode/192.168.0.13:8021. Already tried 8 time(s)

I think maybe something is missing in /etc/hosts file or this hadoop
property is not set correctly:

  dfs.datanode.address
  0.0.0.0:50010
  
The address where the datanode server will listen to.
If the port is 0 then the server will start on a free port.
  


I try on the namenode:
telnet localhost 8021
telnet 192.168.0.10 8021
Both cases I get:
telnet: Unable to connect to remote host: Connection refused



That's the Namenode that isn't there, which should be on 192.168.0.13 at 
8021


You can see what ports really are open on the namenode by identifying 
the process (jps -v will do that) then netstat -a -p | grep PID , where 
PID= process ID of the namenode.


If you can telnet to port 8021 when logged in to the NN, but not 
remotely, it's your firewall or routing interfering


Re: What framework Hadoop uses for daemonizing?

2010-02-10 Thread Steve Loughran

Thomas Koch wrote:

I'm working on a hadoop package for Debian, which also includes init
scripts
using the daemon program (Debian package "daemon") from
http://www.libslack.org/daemon

Can these scripts be used on other distributions, like Red Hat? Or it's a
Debian only daemon?
I'm not familiar enough with Red Hat do answer this. The first thing that 
comes to my mind is, whether Red Hat has the same lsb_* scripts that I do 
source.


RHEL doesn't, the  daemon scripts for SmartFrog have to look for both 
RHEL and lsb; lsb is what you get on SuSE though.


From:
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/release/scripts/etc/rc.d/init.d/smartfrogd?revision=8160&view=markup


if [ -f /lib/lsb/init-functions ]; then
  . /lib/lsb/init-functions
  STOP_SUPPORTED=0
  alias START_DAEMON=start_daemon
#  alias STOP_DAEMON=echo "ignoring stop daemon request"
  alias STATUS=MyStatus
  alias LOG_SUCCESS=log_success_msg
  alias LOG_FAILURE=log_failure_msg
  alias LOG_WARNING=log_warning_msg
elif [ -f /etc/init.d/functions ]; then
  . /etc/init.d/functions
  STOP_SUPPORTED=1
  alias START_DAEMON=start
  alias STOP_DAEMON=stop
  alias STATUS=status
  alias LOG_SUCCESS=success
  alias LOG_FAILURE=failure
  alias LOG_WARNING=passed
else
  echo "Error: your platform is not supported by $0" > /dev/stderr
  exit 1
fi

looks like I actually set a flag saying whether stopping the daemon 
works reliably, hmmm.



Another handy bit of code:

java -version >/dev/null
if [ $? -ne 0 ]; then
  echo "Error: $0 cannot find java. Either it is not installed or the 
PATH is incomplete" > /dev/stderr

exit 6
fi

Testing all of this is fun, I scp my RPMs to virtualized machines then 
walk them through the lifecycle.


Re: Hadoop on a Virtualized O/S vs. the Real O/S

2010-02-09 Thread Steve Loughran

Stephen Watt wrote:

Hi Folks

I need to be able to certify that Hadoop works on various operating 
systems. I do this by running a series it through a series of tests. As 
I'm sure you can empathize, obtaining all the machines for each test run 
can sometimes be tricky. It would be easier for me if I can spin up 
several instances a virtual image of the desired O/S, but to do this, I 
need to know if there are any risks I'm running using that approach.


Is there any reason why Hadoop might work differently on a virtual O/S as 
opposed to running on an actual O/S ? Since just about everything is done 
through the JVM and SSH I don't foresee any issues and I don't believe 
we're doing anything weird with device drivers or have any kernel module 
dependencies.


Kind regards
Steve Watt


I run Hadoop on VMs

- performance can be below raw IO rates, but that's predictable
- if you bring up a private network then you have DNS/rDNS problems. 
Hadoop is happy if everything knows who it is and DNS does too. 
Otherwise: edit the hosts tables
- the big enemy on VMs is unexpected swapping out and clock drift, 
screws up anything that assumes time moves forward at roughly the same 
rate everywhere. Zookeeper assumes this, as do most distributed 
co-ordination systems. If you keep VM load low, one Virtual CPU per 
physical one, and don't overallocate physical memory, most of these 
problems go away
-set the CPU affinity for the VM so it is always bonded to the same CPU, 
using taskset or the equivalent. Minimises cache misses and other problems






Re: Maven and Mini MR Cluster

2010-02-04 Thread Steve Loughran

Michael Basnight wrote:

Ya with the hadoop_home stuff i was grasping at straws. My mini MR Cluster has 
a valid classpath i assume, since my entire test runs (thru 3 mapreduce jobs 
via the localrunner) before it gets to the mini MR cluster portion. Is it 
possible to print out the classpath thru the JVMManager or anything else like 
that for debugging purposes?



probably, though I don't know what.



Re: Maven and Mini MR Cluster

2010-02-04 Thread Steve Loughran

Michael Basnight wrote:
Im using maven to run all my unit tests, and i have a unit test that creates a mini mr cluster. When i create this cluster, i get classdefnotfound errors for the core hadoop libs (Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.Child). When i run the same test w/o creating the mini cluster, well.. it works fine. My HADOOP_HOME is set to the same version as my mvn repo, and points to a valid installation of hadoop. When i validate the classpath thru maven, (dependency:build-classpath), it says that the core libs are on the classpath as well (sourced from my .m2 repository). I just cant figure out why hadoop's mini cluster cant find those jars. Running hadoop 0.20.0. 


Any suggestions?


the miniMR cluster does everything in memory, and doesnt look at 
HADOOP_HOME, which is only for the shell scripts.


It sounds like you need hadoop-mapreduce on your classpath. Sounds like. 
the Child class is the entry point used when creating new JVMs, and it 
is that classpath that isn't right, which is a forked JVM from the one 
the MiniMRCluster was created in.


Re: help for hadoop begginer

2010-02-02 Thread Steve Loughran

nijil wrote:

i have read about basic stuff about hadoop..err i have a few doubts...mind u
am a begginer

1:so is hadoop a file sytem only?

2:can hbase be used instead of other databases in other platforms(eg java)?

3:what is mapreduce exactly and hw is it related to hadoop(i mean is it only
about parallel computing.i dont understand how much paralell computing
is possible in a hadoop cloud sytem which is use for webhosting) .I require
some help on the topic on clubbing "Cloud Computing,Hadoop and
Webhosting"..please this is really important



4:Is hbase and hypertabe similar or is there a big difference

5:Can some one provide a map reduce implementation example other than
related to search engine.

6:How is mapreduce and hadoop related?

7:Can i learn hadoop with a "cloudera's Distribution for Hadoop" vmware
image..

8:how is database synchronization done in hbase.i belive hbase is a
distributed database

:can some one provide contact details for further help if u dont
mind... :)


When does your homework have to be done by? Is there someone we could 
email it to direct so we'd get the credit the person answering the 
questions deserves?


-steve


Re: Can I submit jobs to a hadoop cluster without using hadoop executable

2010-01-25 Thread Steve Loughran

vishalsant wrote:

JobConf.setJar(..) might be the way , but that class is deprecated and no
method in the Job has a corresponding addition.


1. Configuration.set("mapreduce.job.jar", JarName) is what you should 
use to set this. You can find the jar by using 
JobConf.findContainingJar(Class)


2. To submit jobs, you should have the Hadoop JAR(s) on your classpath, 
commons-logging and any other dependencies. For that I have

  commons-httpclient, commons-net, commons-codec, xmlenc, commons-cli,

3. JobClient.submitJob() shows you how to submit jobs; it wraps 
Job.submit() with some extra (optional) setup








Re: Google has obtained the patent over mapreduce

2010-01-21 Thread Steve Loughran

Edward Capriolo wrote:

I was just mentioning that in the sco linux suit , comanies that were
using linux as a fileserver or ftp server were targeted, not only
companies that developed packaged linux.

Relying on a corporation to be  benevolant, is a risk in itself. A
change in management could cause a change in policy.

a decision maker might avoid hadoop as it is more risky to deploy now.
In particular if that company was in some way competitive to google.



SCO sued people who had bought Unix source code licenses and threatened 
end-users of linux over copyright. No patent lawsuits, just doomed 
copyright T&Cs.


Generally those companies that want to work with open source don't waste 
time trying to enforce patents because its a losing battle. You lose a 
lot of goodwill, and when you consider that the Android stack is built 
on truckloads of Apache and other open source java code, the loss of 
that goodwill can be quite significant. Add in that Google is a platinum 
sponsor of apache, and you can see the conflicts of interest that will 
develop.


I have no idea what they will do with the MR patent, but note that the 
ASF license says "Sue someone over some apache code you have a patent 
for and you lose the right to use that apache code yourself". While 
Google don't use Hadoop internally, they have been using it for their 
academic testbed.


Finally, I don't think the patent applies outside the US, and if they 
published before filing, its harder to get an EU/UK patent. So host your 
code here in europe and not only are you free from this patent concern, 
your customers benefit from your requirement to follow EU data 
protection laws. Everyone wins.


-Steve


Re: Hadoop and X11 related error

2010-01-19 Thread Steve Loughran

Tarandeep Singh wrote:




I am using Frame class and it can't be run in headless mode.

Suggestion of Brien is useful (running x virtual frame buffer) and I guess I
am going to do the same on my servers but right now I am testing on my
laptop which is running X server.

The problem here is, the test program runs over ssh without any problem but
when I run the map reduce program I keep getting error. Both the standalone
program and MR program are run as the same user.

I followed Todd's suggestion and checked the value of XAUTHORITY environment
variable. It turns out, the value of this variable is not set when I do ssh.
So I am trying to see if I can set its value and if the MR program runs
after that. But if this is the problem then the standalone program should
also not run.



logged in as the user at the console, set all access open:

xhost +

This should stop the xauthority stuff mattering, leaving only the 
DISPLAY env variable as the binding issue


Re: Hadoop and X11 related error

2010-01-18 Thread Steve Loughran

Tarandeep Singh wrote:

Hi,

I am running a MR job that requires usage of some java.awt.* classes, that
can't be run in headless mode.

Right now, I am running Hadoop in a single node cluster (my laptop) which
has X11 server running. I have set up my ssh server and client to do X11
forwarding.

I ran the following java program to ensure that X11 forwarding is working-


the problem here is that you need to tell AWT to run headless instead of 
expecting an X11 server to hand. Which means setting 
java.awt.headless=true in the processes that are using AWT to do bitmap 
rendering.


see 
http://java.sun.com/developer/technicalArticles/J2SE/Desktop/headless/ 
for details




Re: debian package, ivy problems

2010-01-11 Thread Steve Loughran

Thomas Koch wrote:

Hi,

I started the debian packaging at 
http://git.debian.org/?p=users/thkoch-guest/hadoop.git


I wrote my own ivysettings.xml[1] so that ivy won't go online, but rather uses 
what's already installed as debian packages. However ivy spits some errors at 
me like:



You know there's a trick you can do with ant, run it with the 
build.sysclasspath=only property, then you can leave ivy to do its 
onlineness, but you can completely ignore the values. This is how apache 
gump does its thing. It let's ivy do whatever it wants to, but it 
doesn't get used for classpath setup:

http://ant.apache.org/manual/sysclasspath.html

Ivy would be used to set up the lists of files to copy into tar files, 
but if you grab just the hadoop JARs by name, you don' t have to care 
about that either. Just set the build up with the classpath set to 
everything you want Hadoop to see, run the build with 
build.sysclasspath=only


commons-logging#commons-logging;work...@debian-maven-repo: configuration not 
found in commons-logging#commons-logging;work...@debian-maven-repo: 'master'. 
It was required from org.apache.hadoop#Hadoop;work...@jona commons-logging


It does find commons-logging and puts it in the cache. There are only two 
dependencies that work (commons-cli#commons-cli;1.2, oro#oro;2.0.8) which have 
the conf attribute "common->default". The failing dependencies have "common-

master".



default = "jar plus dependencies"; master = "jar only".

1. what v.of commons logging are you looking to pull in? v 1.1.1's POM 
dependencies look a bit more complex

2. what is your current ivy file?


Re: debian package of hadoop

2010-01-11 Thread Steve Loughran

Isabel Drost wrote:

On Monday 04 January 2010 13:37:48 Steve Loughran wrote:

Jordà Polo wrote:

I have been thinking about an official Hadoop Debian package for a while
too.

If you want "official" as in can say "Apache Hadoop" on it, then it will
need to be managed and released as an apache project. That means
somewhere in ASF SVN. If you want to cut your own, please give it a
different name to avoid problems later.


Huh? I am lost and confused here: As far as I understood Thomas is trying to 
create a Debian package which then goes into the Debian distribution 
(possibly sid at the moment).


I was confused by the term "official". The Hadoop team could certainly 
release  Hadoop RPMs and deb files. which I think is a good idea, just 
because we should recognise that these are the primary ways you install 
Hadoop in big clusters. Having RPM/deb files as part of your build 
process keeps you focused, though it mean that hudson instances not 
running on linux have to skip some stages (i.e. you need a linux CI 
server/VM too, plus another one for your install testing)


Same was done e.g. with Lucene, httpd, Tomcat etc. All of these packages are 
maintained by Debian people and not pushed by Apache guys. Still the packages 
are named tomcat5.5, apache2.2-common, liblucene-java. So it seems possible 
to name official Debian packages similar to the upstream Apache project w/o 
much problems.


OK



Re: debian package of hadoop

2010-01-04 Thread Steve Loughran

Thomas Koch wrote:

Hi Jordà,


The main issue that prevents the inclusion of the current Cloudera
package into Debian is that it depends on Sun's Java. I think it would
be interesting, at least for an official Debian package, to depend on
OpenJDK in order to make it possible to distribute it in "main" instead
of "contrib".
The build-depends line can easily be changed as long as hadoop will build with 
openjdk. The binary will depend on java5-runtime-headless which is provided by 
any java runtime. So the user of the package is free to choose either Sun or 
openjdk.



Java6+ only.  It will build on openjdk or jrockit, the Hadoop team 
merely chooses to ignore all bug reports that you can't recreate on the 
official JDKs. You are still free to fix them yourself. You must also 
know that your JVM hasn't been tested at scale, unless you have the 
scale to compare with the big datacentres.


What use cases are you thinking of here?

1) developer coding against the hadoop Java and C APIs
2) Someone setting up a small 1-5 machine cluster
3) large production datacentre of hundreds of worker nodes
4) transient virtualised worker nodes


for (3) and (4) the challenge is getting the right configuration out 
there, where configuration =

hadoop XML files
log4j settings
rack awareness scripts
and such like

For virtualised clusters you set up one node then ask the infrastructure 
for 100 instances; for physical ones you just need to get the right 
files out everywhere. Packaging them up and pushing it out as a .deb or 
RPM is one option -the cloudera one- and is better than trying to by 
hand -but it is only one option.


Re: Program is running well in pseudo distributed mode on Hadoop-0.18.3, but it is not running in distributed mode on 4 nodes(each running Redhat linux 9)

2010-01-04 Thread Steve Loughran

Ravi wrote:

Hi,
  I have designed a mapreduce algorithm for all pairs shortest paths 
problem. As a part of the implementation of this algorithm, I have 
written the following mapreduce job. It is running well and producing 
desired output in pseudo distributed mode. I have used a machine with 
ubuntu 8.04 and hadoop-0.18.3 to run the job in pseudo distributed mode. 
When I tried to run the same program on a cluster of 4 machines(each 
running Redhat linux 9) with the same version of hadoop(hadooop-0.18.3), 
the program is not giving any errors but its not giving any output as 
well(The output file is blank). This is the first time I am facing this 
kind of problem.
I am attaching the jar file of the program and sample inputs: out1 and 
out2 as well.(The program need to read input from these two files)
I have searched the archive but didn't find any mail mentioning this 
problem. I have googled, but it was of no use.

I am not able to find out what am I missing in the code.


First, the bad news: nobody is going to debug your program for you. It's 
your program, you get to learn about distributed debugging. We all have 
our own programs and their bugs to deal with, and if its a problem with 
your physical cluster, then nobody but you are in a psition to fix.


Now, the good news: the skills you learn on this simple app scale well 
to bigger clusters and more complex programs. Accordingly, it is 
absolutely essential that you do learn this process now, while your 
problem is still small.


http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms

* Tell Hadoop to save all failed outputs with keep.failed.task.files
* See what gets retained on the worker machines, and their logs
* log at log4j's debug level in your code, run the nodes with your 
classes set to log at debug level (leave the rest at info, for now). 
Leave the logging settings in, they may come in useful later, just check 
with log.isDebugEnabled() before constructing strings or other work to 
create the log entries.


> Should I be using hadoop-0.20?
>

It's not likely to magically make your problem go away, if that is what 
you were wondering.


Re: debian package of hadoop

2010-01-04 Thread Steve Loughran

Jordà Polo wrote:

On Wed, Dec 30, 2009 at 07:53:43PM +0100, Thomas Koch wrote:

today I tried to run the cloudera debian dist on a 4 machine cluster. I still
have some itches, see my list below. Some of them may require a fix in the
packaging.
Therefor I thought that it may be time to start an official debian package of
hadoop with a public GIT repository so that everybody can participate.
Would cloudera support this? I'd package hadoop 0.20 and apply all the
cloudera patches (managed with topgit[1]).
At this point I'd like to have your opinion whether it would be wise to have
versioned binary packages like hadoop-18, hadoop-20 or just plain hadoop for
the Debian package?


Hi Thomas,

I have been thinking about an official Hadoop Debian package for a while
too.


If you want "official" as in can say "Apache Hadoop" on it, then it will 
need to be managed and released as an apache project. That means 
somewhere in ASF SVN. If you want to cut your own, please give it a 
different name to avoid problems later.



The main issue that prevents the inclusion of the current Cloudera
package into Debian is that it depends on Sun's Java. I think it would
be interesting, at least for an official Debian package, to depend on
OpenJDK in order to make it possible to distribute it in "main" instead
of "contrib".


+1 to more on packaging; I'd go so far as push for a separate 
"deployment" subproject which would be downstream of everything, 
including HBase and other layers.


I view .deb and .RPM releases as stuff you would push out to clusters, 
maybe with custom config files for everything else. Having the ability 
to create your own packages on demand would appear to be something that 
people need (disclaimer, I do create my own RPMs)


I would go for the package to not bother mentioning which Java it 
depends on, as that lets you run on any Java version, jrockit included. 
Or drive the .deb creation process such that you can decide at release 
time what the options are for any specific target cluster.




Also, note that in order to fit into Debian's package autobuilding
system, some scripts will probably require some tweaking. For instance,
by default Hadoop downloads dependencies at build time using ivy, but
Debian packages should use already existing packages. Incidentally,
Hadoop depends on some libraries that aren't available in Debian yet,
such as xmlenc, so there is even more work to do.


Well, we'll just have to ignore the debian autobuilding process then, 
won't we?


There are some hooks in Ivy and Ant to give local machine artifacts 
priority over other stuff, but it's not ideal. Let's just say there are 
differences in opinion between some of the linux packaging people and 
others as to what is the correct way to manage dependencies. I'm in the 
"everything is specified under SCM" camp; others are in the "build 
against what you find" world.


I cut my rpms by
*  pushing the .rpm template through a  with property expansion; 
this creates an RPM containing all the version markers set up right and 
driven my build's properties files.
* not declaring dependencies on anything, java or any other JAR like 
log4J. This ensures my code runs with the JARs I told it to, not 
anything else. Also it gives me the option to sign all the JARs, which 
the normal Linux packaging doesn't like.
* releasing the tar of everything needed to sign the JARs and create the 
RPMs as a redistributable. This gives anyone else the option to create 
their own RPMs too. You don' t need to move the entire build/release 
process to source RPMs or .debs for this, any more than the Ant or log4J 
packages get built/released this way.
*  the packages to a VMWare or Virtualbox image of each supported 
platform, ssh in and exec the rpm uninstall/install commands, then walk 
the scripts through their lifecycle. You were planning on testing the 
upgrade process weren't you ?


-steve



(Anyway, I'm interested in the package, so let me know if you need some
help and want to set up a group on alioth or something.)


A lot of the fun here is not going to be setting up the package files (



Re: Best CFM Engine for Hadoop

2009-12-10 Thread Steve Loughran

Edward Capriolo wrote:

On Thu, Dec 10, 2009 at 9:59 AM, Steve Loughran  wrote:




Let me kick it off with not everything fits every environment, your
millage may vary.


RPMs are not actually that bad for getting stuff out, especially if you can
do PXE/kickstart stuff and bring up machines from scratch. One problem: the
need to rebuild and push out RPMs for every change, if you push out
configuration that way.

True, you have to make the choice if you want to role your own RPM for
your configuration files, or manage those on the side.  Making an RPM
called our-hadoop-config.rpm is actually not complicated compared to
say learning how some other configuration pushing system works, and
might take less time.



that's a good point; knowing how to put together RPMs is a useful skill. 
Also, you can ignore the update problem by encouraging clean machine 
builds from time to time, at the very least uninstall/reinstall


FWIW, we do make our own RPMs which contain the various JARs and OS 
integration scripts, but we push out the configuration differently. 
Keeps things more agile, but leaves you with that problem of "what is 
your HA solution for maintaining configuration state in your cluster". 
There's a lot to be said for filesystems




Re: Best CFM Engine for Hadoop

2009-12-10 Thread Steve Loughran

Edward Capriolo wrote:
system to ignore this file.


So now that I am done complaining, what do I think should do?

1 clearly document your install process
2 make you install process fully script-table
--or--
3 role your own rpms (or debs, tar etc) for everything not in someone else RPM
4 run 1 nightly backup for the each server class
5 revision your config files
6 (optionally) use tripwires/MD5s only to check for unauthorized changes

Anyway, my long short point, get something that works the way you want
it to. Look out for systems that offer you "new" and "exciting" ways
to do things that only take 10 seconds, like edit /etc/fstab, or
install an RPM.


RPMs are not actually that bad for getting stuff out, especially if you 
can do PXE/kickstart stuff and bring up machines from scratch. One 
problem: the need to rebuild and push out RPMs for every change, if you 
push out configuration that way.

Other problems:
 * its possible for different RPMs to claim ownership of things, much 
confusion arises
 * the RPM dependency model doesn't work that well with Java. I say 
that as someone who has outstanding disputes with the JPackage scheme, 
and who also knows that the maven/ivy dependency model is flawed too 
(how do you declare in any of these tools that you want "an xml parser 
with XSD validation" without saying which one.
* spec files are painful to work with, so is their build and test 
process. You do have a test process, right ?
* The way RPMs upgrade is brain dead; they install the new stuff then 
decide whether or not to uninstall the old stuff, makes it very hard to 
do some upgrades that change directory structure


-Steve



Re: Start Hadoop env using JAVA or HADOOP APIs (InProcess)

2009-12-07 Thread Steve Loughran

samuellawrence wrote:

Hai,

I have to start the HADOOP environment using java code (inprocess). I would
like to use the APIs to start it.  


Could anyone please give me snippet or a link.


Hi
1. I've been starting/stopping Hadoop with SmartFrog, in JVM. Email me 
direct and I will point you at some of the code, though I have branched 
Hadoop (temporarily) to make it easier.


2. The MiniDFS and MiniMR clusters used in testing actually do this 
internally


3. As a result of (1)  I know what the troublespots are if you are 
trying to run Hadoop in the VM of any other code
 * changes to the java security stuff in 0.21; incompatible with 
security managers

 * still a fair few singletons in the services
 * JSP under Jetty can be quirky and not always restart. This is not a 
hadoop-level bug, but something deep in Jasper, something probably 
related to JSP classloaders.


You can do this stuff in production, but you should consider having a 
separate VM for each service, and terminating the process when you are 
done with the specific node type. It's safer that way


-steve



Re: Web Interface Not Working

2009-12-02 Thread Steve Loughran

Mark Vigeant wrote:

Todd,

I followed your suggestion, shut down everything, restarted it, and the UI is 
still not there. Jps shows NN and JT working though.



Web UI is precompiled JSP on jetty; the rest of the system doesn't need 
it, and if the JSP JARs aren't on the classpath, Jetty won't behave.

 * make sure that you have only one version of Jetty on your classpath
 * make sure you only have one set of JSP JARs on the CP
 * make sure the jetty jars are all consistent (not mixing versions)
 * check that the various servlets are live (the TT and DNs have them). 
No servlets: jetty is down.


I think you can tell Jetty to log at more detail, worth doing if you are 
trying to track down problems.


Re: why does not hdfs read ahead ?

2009-11-25 Thread Steve Loughran

Michael Thomas wrote:

Hey guys,

During the SC09 exercise, our data transfer tool was using the FUSE 
interface to HDFS.  As Brian said, we were also reading 16 files in 
parallel.  This seemed to be the optimal number, beyond which the 
aggregate read rate did not improve.


We have worked scheduled to modify our data transfer tool to use the 
native hadoop java APIs, as well as running some additional tests 
offline to see if the HDFS-FUSE interface is the bottleneck as we suspect.


Regards,

--Mike


Was this all local data?

IN Russ Perry's little paper "High Speed Raster Image Streaming For 
Digital Presses Using the Hadoop File System", he got 4Gb/s over the LAN 
by having a client app deciding which datanode to pull each block from, 
rather than having the NN tell them which node to ask for which block


"Measured stream rates approaching 4Gb/s were achieved which is close to 
the required rate for streaming pages containing rich designs to a 
digital press. This required only a minor extension to the Hadoop client 
to allow file blocks to be read in parallel from the Hadoop data nodes."


http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html



Re: Building Hadoop from Source ?

2009-11-24 Thread Steve Loughran

Siddu wrote:

On Thu, Nov 12, 2009 at 11:50 PM, Stephen Watt  wrote:


Hi Sid

Check out the "Building" section in this link -
http://wiki.apache.org/hadoop/HowToRelease . Its pretty straight forward.
If you choose to not remove the test targets expect the build to take
upwards of 2 hours as it runs through all the unit tests.



Hi all,

I Have checked out the source code from

URL: https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20
Repository Root: https://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 763504
Node Kind: directory
Schedule: normal
Last Changed Author: nigel
Last Changed Rev: 763504
Last Changed Date: 2009-04-09 09:20:11 +0530 (Thu, 09 Apr 2009)

and i tried to build it using ant command

any idea why its stopping at resolving dependencies ?

ivy-resolve-common:
[ivy:resolve] :: resolving dependencies ::
org.apache.hadoop#Hadoop;work...@sushanth-laptop
[ivy:resolve] confs: [common]



Probably can't retrieve the dependency. Either your laptop is offline, 
or Ant isn't set up to go through any proxy you have


Re: Fw: Alternative distributed filesystem.

2009-11-19 Thread Steve Loughran

Reshu Jain wrote:

Hi
I wanted to propose IBM's Global Parallel File System™ (GPFS™ ) as the 
distributed filesystem. GPFS™ is well known for its unmatched 
scalability and
leadership in file system performance, and it is now IBM’s premier 
storage virtualization solution. More information at 
http://www-03.ibm.com/systems/clusters/software/gpfs/
GPFS™ is POSIX compliant and designed for general workloads. The 
filesystem performance is designed for random access behavior on I/O 
operations


With a requirement for RAID5 and infinibad SAN which increases cost/GB 
and as a result I've not yet seen a GPFS™ filestore in the PB range.


We at IBM have been running Hadoop on GPFS™ and have seen comparable 
performance number to HDFS. We are in the process of putting the GPFS 
plugin into Hadoop code. In which case you will also be able to run 
Map-Reduce jobs natively on GPFS™ along with other applications on the 
same cluster."


I'm not going to criticise GPFS™for what it does -deliver Posix™ at 
infiniband rates, but in the Ananthanarayanan et al paper


1. They had to turn off prefetch which will degrade other apps using the 
filestore


2.  They ran Hadoop™ on HDFS with a replication factor of 2; this 
increases the probability of non-local work and hence the ethernet 
becomes more of a bottleneck.  3 is not just recommended for data 
availability, but workload distribution.




The following papers talk about the performance comparison between HDFS 
and GPFS™ and changes made to GPFS™ to support map reduce data loads.


Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu 
Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari: Cloud analytics: Do we 
really need to reinvent the storage stack? in/ /*/Workshop on Hot Topics 
in Cloud Computing (HotCloud '09)/* 
/ /June 15, San Diego, CA



There's also Lustre and Red Hat GFS. I haven't played with either; don't 
know how well Hadoop works on it, or what changes you'd need to make to 
provide locality data to the JobTracker, which is the main feature 
needed to do effective scheduling


-Steve



Re: names or ips in rack awareness script?

2009-11-19 Thread Steve Loughran

Michael Thomas wrote:
IPs are passed to the rack awareness script.  We use 'dig' to do the 
reverse lookup to find the hostname, as we also embed the rack id in the 
worker node hostnames.




It might be nice to have some example scripts up on the wiki, to give 
people a good starting place


Re: showing hadoop status UI

2009-11-18 Thread Steve Loughran

Mark N wrote:

I want to show the status of M/R jobs on user interface , should i read the
default hadoop counters to display some kind of
map/ reduce tasks?


I could read the status of  map/reduce task using Jobclient  ( hadoop
default counters ) .  I can then have a java websevice exposing these
functions so that
other module ( such as c#, vb.net ) can access and show the status on UI

is this a correct approach ?




1. The HTML pages themselves are a user interface. They could be cleaned 
up, made more secure, etc, but anything you do there would benefit 
everyone. It would also be much easier to test than any rich client, as 
we can use HtmlUnit to stress the site.


2. there's some JSP output of status as XML forthcoming, grab the 
trunk's code and take a look


3. There are things like Hadoop Studio which provide a GUI, 
http://www.hadoopstudio.org/ ; and some Ant tasks in a Hadoop distro. 
You are much better off using someone elses code than writing your own.


You aren't going be able to talk from .NET to Hadoop right now; I've 
discussed having a long-haul route to Hadoop, "Mombasa": 
http://www.slideshare.net/steve_l/long-haul-hadoop , but not implemented 
anything I'd recommend anyone other than myself and my colleagues to 
use, as its pretty unstable.


I do think a good long-haul API for job submission would be nice, one 
that also works with higher-level job queues, like Cascading, Pig, other 
Hadoop workflows. I also think we should steer clear of WS-*, because 
its wrong, although that will mean that you won't be able to generate 
VB.net or C# stubs straight from WSDL.


Summary: use the web GUI and help improve if it you can, try using 
things like Hadoop Studio if it is not enough. If you want to help build 
a long-haul API it would be good, but its going to involve a lot of 
effort and you wont' see benefits for a while


-stve


Re: client.Client: failed to interact with node......ERROR

2009-11-17 Thread Steve Loughran

Johannens Zillmann wrote:

Hi Steve,

in the meantime Yair posted logs with hadoop debug log level.
--
09/11/17 01:31:59 DEBUG ipc.Client: IPC Client (47) connection to 
qa-hadoop005.ascitest.net/10.12.2.205:2 from root: starting, having 
connections 2
09/11/17 01:31:59 DEBUG ipc.Client: closing ipc connection to 
qa-hadoop005.ascitest.net/10.12.2.205:2: null

java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
...
--

Does this anyone give an idea ?


probably a versioning problem. the two machines aren't in sync. Hadoop 
IPC is very brittle against change, the price of speed and compactness


Re: client.Client: failed to interact with node......ERROR

2009-11-17 Thread Steve Loughran

Johannens Zillmann wrote:

Hi there,

i directed Yair to this list because the exception make me think i could 
be a problem of using hadoop ipc versus using hadoop ipc in a servlet 
container like tomcat. Thought maybe a problem with static variables.

To explain, katta uses plain hadoop ipc for search communication.



hadoop IPC works in other envs fairly well, though server side someone's 
being doing things with java security that is incompatible with any 
other java security environment, which probably includes locked down 
tomcat. That's in the datanodes and namenode though, not the client


Re: mapred.local.dir options

2009-11-16 Thread Steve Loughran

Edward Capriolo wrote:

Steve,

I ran into something with this parameter that is troublesome.

If I remember correctly mapred.local.dir used by both TaskTracker and
JobTracker.

This was a subtle problem for me because I am not able to share my
hadoop-site.xml between these nodes.

For example, right now I am sharing my configuration so I did not have
to branch it. This makes it impossible for me to do:

-Dmapred.job.tracker=local

Because each datanode does not have the same directory structure as my
JobTracker. Should mapred.local.dir be split into two separate
variables?


it takes a list so could list dirs that are valid on different machines

mapred.local.dir=/jobtracker/local, /tt/local

or play with symlinks on the machines to make their filesystems look 
more consistent


mapred.local.dir options

2009-11-16 Thread Steve Loughran


I see that the mapred.local.dir is served up round robin, as with the 
dfs.data.dir values. But there's no awareness of the possibility that 
the same disk partition is used for mapred local data and for datanode 
blocks.


What do people do here?

* keep their fingers crossed that if the MR job creates too much data, 
it doesn't interfere with the datanodes

* use separate partition(s) for mapred.local.dir
* Set really high mapred.local.dir.minspacestart and 
mapred.local.dir.minspacekill values, 10s of GB, so MR jobs can't 
normally come close to causing the partitions to run out of space


I'll put whatever comes down as best practise up on the 
http://wiki.apache.org/hadoop/DiskSetup page


Re: Creating a new configuration

2009-11-16 Thread Steve Loughran

Jim Twensky wrote:

The documentation on configuration states:


Unless explicitly turned off, Hadoop by default specifies two
resources, loaded in-order from the classpath:

   1. core-default.xml : Read-only defaults for hadoop.
   2. core-site.xml: Site-specific configuration for a given hadoop
installation.


Does this mean when creating a configuration, we have to expilicitly
add the resources like:

conf.addResource("hdfs-default.xml");
conf.addResource("hdfs-site.xml");
conf.addResource("mapred-default.xml");
conf.addResource("mapred-site.xml");

to make those settings visible to the client program? Skimming through
the source code, I got the impression that hdfs and mapred settings
are not loaded by default and I keep adding above four lines to every
client I implement.


Looking at the source is always handy

The Configuration(boolean loadDefaults) constructor loads all files 
marked as "default" via addDefaultResource(); you should use it to 
register resources which are always to be loaded whenever the 
constructor with loadDefaults=true is called


You can rummage through the source to see where it is called; the 
(deprecated in 0.21+) JobConf registers the mapred files; and the static 
initializer of DistributedFileSystem loads the hdfs ones



Speaking of which, what's the difference between addResource and
addDefaultResource methods? Should I use addDeafultResource methods to
add the above resources instead? The documentation is not clear to me
as to which one is appropriate.


only use addDefaultResource to add XML files to every configuration 
created from then on. Add a resource to an explicit instance if you are 
only working with that instance




Re: About Hadoop pseudo distribution

2009-11-12 Thread Steve Loughran

kvorion wrote:

Hi All,

I have been trying to set up a hadoop cluster on a number of machines, a few
of which are multicore machines. I have been wondering whether the hadoop
pseudo distribution is something that can help me take advantage of the
multiple cores on my machines. All the tutorials say that the pseudo
distribution mode lets you start each daemon in a separate java process. I
have the following configuration settings for hadoop-site.xml:


You should just give every node with more cores more task tracker slots; 
this will give it more work from the Job Tracker


Re: NameNode/DataNode & JobTracker/TaskTracker

2009-11-11 Thread Steve Loughran

John Martyniak wrote:

Thanks Todd.

I wasn't sure if that is possible.  But you pointed out an important 
point and that is it is just NN and JT that would run remotely.


So in order to do this would I just install the complete hadoop instance 
on each one.  And then would they be configed as masters?


yes. same config, just run the namenode and jobtracker alone


Or should NameNode and JobTracker run on the same machine?  So there 
would be one master.


they can do that, you may want to separate them later, it depends on the 
size of the filesystem and the NN's memory requirements.


one trick here is to add a couple of DNS entries that initially point to 
the same physical host -namenode and jobtracker; then if you split the 
machines up, change the DNS entries and everyone who reconnects gets the 
new machines -no need to edit every bookmark or config file



So when I start the cluster would I start it from the NN/JT machine.  
Could it also be started from any of the other cluster members.


in a big cluster you would normally use some CM tool or init.d stuff to 
start the processes


All datanodes block until the NN is live; all TTs block for the JT; the 
JT (in 0.21+) blocks waiting for the filesystem.






Re: Linux Flavor

2009-11-03 Thread Steve Loughran

Tom Wheeler wrote:


Based on what I've seen on the list, larger installations tend to use
RedHat Enterprise Linux or one of its clones like CentOS.


One other thing to add is that a large cluster is not the place to learn 
linux or solaris or whatever -it helps to have a working knowledge of 
how to look after an OS before getting 80 machines that need to be kept 
in sync. Then forget every technique you learned to manually keep a 
single OS image up to date and learn about large cluster admin techniques


Re: Linux Flavor

2009-11-03 Thread Steve Loughran

Todd Lipcon wrote:

We generally recommend sticking with whatever Linux is already common inside
your organization. Hadoop itself should run equally well on CentOS 5, RHEL
5, or any reasonably recent Ubuntu/Debian. It will probably be OK on any
other variety of Linux as well (eg SLES), though they are less commonly
used.

The reason that RHEL/CentOS is most common for Hadoop is simply that it's
most common for large production Linux deployments in general, and
organizations are hesitant to add a new flavor for no benefit.

Personally I prefer running on reasonably recent Ubuntu/Debian since you get
some new kernel features like per-process IO accounting (for iotop), etc.


I am busy debugging why my laptop doesn't boot since upgrading to ubuntu 
9.10 at the weekend, they've done some off things to filesystem to 
produce faster boots that I'm not convinced work


Re: Regd. Hadoop Implementation

2009-10-29 Thread Steve Loughran

shwitzu wrote:

Thanks for Responding,

I read about HDFS and understood how it works and I also installed hadoop in
my windows using cygwin and tried a sample driver code and made sure it
works.

But my concern is, given the problem statement how should I proceed

Could you please give me some clue/ pseudo code or a design.



I would start the design process the way you would with any other large 
project


-understand the problem
-get an understanding of the available solution space -and their limitations
-come up with one or more possible solutions
-identify the risky bits of the system, the assumptions you may have, 
the requirements you have of other things, the bits you dont' really 
understand
-prototype something that tests those assumptions, acts as a first demo 
of what is possible -or an immediate

-start with the basic tests and automated deployment
-evolve

Plus all the scheduling stuff that goes with it

Asking a mailing list for pseudo code or a design is doomed. Really. 
This is a major distributed application and you need to be thinking 
about it at scale, and you need to understand both the user needs and 
the capabilities of the underlying technologies. Nobody else on this 
list understands the needs, and there is no way for you to be sure that 
any of us understand the technologies. Which means there is no way you 
can trust any of us to produce a design that works -even if anyone was 
prepared to sit down and do your work for you.


sorry, but that's not how open source communities tend to work. Help 
with problems, bugs, yes. Design your application -no. You are on your 
own there.


-steve


Re: editing etc hosts files of a cluster

2009-10-21 Thread Steve Loughran

David B. Ritch wrote:

Most of the communication and name lookups within a cluster refer to
other nodes within that same cluster.  It is usually not a big deal to
put all the systems from a cluster in a single hosts file, and rsync it
around the cluster.  (Consider using prsync, which comes with pssh,
http://www.theether.org/pssh/, or your favorite cluster management
software.)
Editing each individually clearly doesn't scale; but editing it once and
replicating it does.

Is a large hosts file less efficient than nscd or a caching DNS server
for nodes within the cluster?



Pro
 * removes the DNS server as a SPOF
 * works on clusters without DNS servers (virtual ones, for example)
 * lets you set up private hostnames ("namenode", "jobtracker") that 
don't change

 * lets you keep the cluster config under SCM

Con
 * harder to push out changes
 * wierd errors when your cluster is inconsistent


We could do a lot in Hadoop in detecting and reporting DNS problems; 
contributions here would be very welcome. They are a dog to test though.




Re: editing etc hosts files of a cluster

2009-10-21 Thread Steve Loughran

Allen Wittenauer wrote:

A bit more specific:

At Yahoo!, we had either every server as a DNS slave or a DNS caching
server.  


In the case of LinkedIn, we're running Solaris so nscd is significantly
better than its Linux counterpart.  However, we still seem to be blowing out
the cache too much.  So we'll likely switch to DNS caching servers here as
well. 


the standard hadoop scripts don't tune DNS caching in the JVM, so Hadoop 
doesn't notice DNS entries changing; that adds extra complexity to the 
DNS-lookup-failure class of bugs -the situation where the TT and forked 
jobs see different IP addresses for the same hosts


Re: Hardware Setup

2009-10-15 Thread Steve Loughran

Brian Bockelman wrote:

Hey Alex,

In order to lower cost, you'll probably want to order the worker nodes 
without hard drives then buy them separately.  HDFS provides a 
software-level RAID, so most of the reasonings behind buying hard drives 
from Dell/HP are irrelevant - you are just paying an extra $400 per hard 
drive.  I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve 
knows an HP model that has something similar.


I will start the official disclaimer "I make no recommendations about 
hardware" here, so as not to get into trouble. Talk to you reseller or 
account team.


You can get servers with lots of drives in them DL180 and SL170z are 
acronyms that spring to mind.

Big issues to consider
 * server:CPU ratio
 * power budget
 * rack weight
 * do you ever plan to stick in more CPUs? Some systems take this, 
others don't.

 * Intel vs AMD
 * How much ECC RAM can you afford. And yes, it must be ECC.

server disks are higher RPM and specced for more hours than "consumer" 
disks, I don't know what that means in terms of lifespan, but the RPM 
translates into bandwidth off the disk.




However, BE VERY CAREFUL when you do this.  From experience, a certain 
large manufacturer (I don't know about Dell/HP) will refuse to ship (or 
sell separately) hard drive trays if you order their machine without 
hard drives.  When this happened to us, we were not able to return the 
machines because they were custom orders.  Eventually, we had to get 
someone to go to the machine shop and build 72 hard drive trays for us.


that is what physics PhD students are for, at least they didn't get a 
lifetimes radiation dose for this job




Worst. Experience. Ever.

So, ALWAYS ASK and make sure that you can buy empty hard drive trays for 
that specific model (or at least that it ships with them).


Brian




On Oct 15, 2009, at 10:48 AM, Alex Newman wrote:


 So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.





It's an interesting Q as to what is better, fewer nodes with more 
storage/CPU or more, smaller nodes.


Bigger servers
 * more chance of running code near the data
 * less data moved over the LAN at shuffle time
 * RAM consumption can be more agile across tasks.
 * increased chance of disk failure on a node; hadoop handles that very 
badly right now (pre 0.20 -datanode goes offline)


Smaller servers
 * easier to place data redundantly across machines
 * less RAM taken up by other people's jobs
 * more nodes stay up when a disk fails (less important on 0.20 onwards)
 * when a node goes down, less data to re-replicate across the other 
machines


1. I would like to hear other people's opinions,

2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks 
from your real data runs. Try that, collect some data, then go to your 
suppliers.


-Steve

COI disclaimer signature:
---
Hewlett-Packard Limited
Registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England


Re: detecting stalled daemons?

2009-10-15 Thread Steve Loughran

Edward Capriolo wrote:


I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.



HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its 
own branch.


This is still something I'm working on, the code works, all the tests 
work, but there are some quirks with JobTracker startup now that it 
blocks waiting for the filesystem to come up that I'm not happy with;  I 
need to add some new tests/mechanisms to shut down a service while it is 
still starting up, which includes interrupting the JT and TT.


You can get RPMs with all this stuff packaged up for use from 
http://smartfrog.org/ , with the caveat that it's still fairly unstable.


I am currently work on the other side of the equation, integration with 
multiple cloud infrastructures, with all the fun testing issues that follow:

http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0


* The simplest liveness test for any of the workers right now is to hit 
their HTTP pages, its the classic "happy" test. We can and should extend 
this with more self-tests, some equivalent of Axis's happy.jsp. The nice 
thing about these is they integrate well with all the existing web page 
monitoring tools, though I should warn that the same tooling that tracks 
and reports the health of a four-way app server doesn't really scale to 
keeping an eye on 3000 task trackers. It's not the monitoring, but the 
reporting.


* Detecting failures of TTs and DNs is kind of tricky too; it's really 
the namenode and jobtracker that know best. We need to get some 
reporting in there so that when either of the masters think that one of 
their workers is playing up, they report it to whatever plugin wants to 
know.


* Handling failures of VMs is very different from physical machines. You 
just kill the VM and restart a new one. We don't need all the 
blacklisting stuff, just some infrastructure operations and various 
notifications to the ops team.


-steve







Re: NameNode high availability

2009-10-05 Thread Steve Loughran

Isabel Drost wrote:

On Mon, 05 Oct 2009 10:28:58 +0100
Steve Loughran  wrote:

2. Even LGPL and GPL say no need to contribute back if you dont 
distribute the code


Sorry in advance about the nitpicking: IANAL - but AFAIK even LGPL and
GPL do not force you to contribute back. The only thing that GPL does,
is forcing you to distribute the source code (including potential
modifications from your side) along with the binary you created*.



Good point. You only have to publish the source to the customers who 
receive the binaries, not get the changes back in to the codebase.


If you run GPL code on your own servers, you aren't distributing it, so 
the requirements to publish any source don't even kick in.


Now, the ASF license doesn't even require you to publish your changes. 
Even so, because you end up taking on all maintenance costs -including 
testing- it's not something I'd recommend. You might think because you 
have put in six months worth of effort that your code is somehow better 
-and it may be- but the value of that effort will evaporate unless you 
keep investing effort merging the changes in with the code, and there is 
a risk that other people who want the same feature will do something 
different.


In the specific case of Hadoop, it is a very fast moving codebase, as 
someone who does keep his own branch and updates it every few weeks, the 
code split was pretty traumatic all round. Things have stopped moving 
now, but it will have hurt everyone who branched. Whoever wrote an HA 
namenode will not only have to deal with those changes, but the ongoing 
extensions to the NN to improve recovery -the backup namenode, and other 
implications.


I'm steering clear of all that, but I am trying to make it easier to 
manage and configure the various services, so that you can bring them up 
in different ways quite rapidly. Ideally I'd like every worker node to 
handle the failure and migration of the master nodes far more gracefully 
than today.


-steve


Re: Hey Cloudera can you help us In beating Google Yahoo Facebook?

2009-10-05 Thread Steve Loughran

Smith Stan wrote:

Hey Cloudera genius guys .


Sorry, not cloudera. I speak for myself.


I read this

Via Cloudera, Hadoop is currently used by most of the giants in the
space including Google, Yahoo, Facebook (we wrote about Facebook’s use
of Cloudera here), Amazon, AOL, Baidu and more.


I would be doubful that any on that list use the cloudera distro, 
because once you manage a cluster to the extent you create your own RPMs 
for PXE-preboot and kickstart install then you know what you are doing 
and will be worrying more about the power budget of your datacentre -as 
measured in megawatts-, and whether your off-site replication plan is 
copying data to other facilities on different earthquake fault lines for 
than how hadoop-site.xml works.




On.
http://www.techcrunch.com/2009/10/01/hadoop-clusters-get-a-monitoring-client-with-cloudera-desktop/

if this is true can you guys help us beat Y G and F.


This is not much different from saying these companies all use TCP/IP, 
Http, MySQL and Linux, therefore a Linux server running apache and 
mysqld will help you to beat them.


Hadoop is a tool for very large datasets, works best if you can group 
and scan them independently.


* If you do not know what you are doing, it will not help
* if you do not have a sufficiently large dataset, it is not worth the 
effort

* if you havent outgrown an RDBMS, stick with the database
* Cloudera are offering to help with running/using hadoop, but they 
aren't going to code your datamining algorithms for you.


see also: http://teddziuba.com/2008/04/im-going-to-scale-my-foot-up-y.html

-Steve


Re: NameNode high availability

2009-10-05 Thread Steve Loughran

Stas Oskin wrote:

Hi.
Intresting - aren't they supposed to contribute back, as Hadoop is open
source?

Regards.

2009/10/3 Otis Gospodnetic 


Related (but not helping the immediate question).  China Telecom developed
something they call HyperDFS.  They modified Hadoop and made it possible to
run a cluster of NNs, thus eliminating the SPOF.

I don't have the details - the presenter at Hadoop World (last round of
sessions, 2nd floor) mentioned that.  Didn't give a clear answer when asked
about contributing it back.

 Otis



1. Apache license says no need to contribute back
2. Even LGPL and GPL say no need to contribute back if you dont 
distribute the code


so no, they dont.

By not distributing their patches, they say "we agree to take on all 
maintenance and testing costs forever"


Their choice.


Re: NameNode high availability

2009-10-02 Thread Steve Loughran

Stas Oskin wrote:

Hi.

The HA service (heartbeat) is running on Dom0, and when the primary
node is down, it basically just starts the VM on the other node. So
there not supposed to be any time issues.

Can you explain a bit more about your approach, how to automate it for example?


* You need to have something " a resource manager" keeping an eye on the 
NN from somewhere. Needless to say, that needs to be fairly HA too.


* your NN image has to be ready to go

* when the deployed NA goes away, bring up a new machine with the same 
image, hostname *and IP Address*. You can't always pull the latter off, 
it depends on the infrastructure. Without that, you'd need to bring up 
all the nodes with DNS caching set to a short time and update a DNS entry.


This isn't real HA, its recovery.


Re: NameNode high availability

2009-10-02 Thread Steve Loughran

Stas Oskin wrote:

Hi.

Could you share the way in which it didn't quite work? Would be valuable

information for the community.



The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which
would be running over DRBD, as described here:
http://www.drbd.org/users-guide/ch-xen.html

The VM will be monitored by heart-beat, which would restart it on another
node when it fails.

I wanted to go that way as I thought it's perfect in case of small cluster,
as then the node can be re-used for other tasks.
Once the cluster grows reasonably, the VM could be migrated to dedicated
machine in live fashion - with minimum downtime.

Problem is, that it didn't work as expected. The Xen over DRBD is just not
reliable, as described. The most basic operation of live domain migration
works only in 50% of cases. Most often the domain migration leaves the DRBD
in read-only status, meaning the domain can't be cleanly shut down - only
killed. This often leads in turn to NN meta-data corruption.


It's probably a quirk of virtualisation, all those clocks and things, 
causes trouble for any HA protocol running round the cluster. I would 
not blame Xen, as VMWare and virtualbox are also tricky.


As you have a virtual infrastructure, why not have an image of the 1ary 
NN, ready to bring up on demand when the NN goes down, pointed at a copy 
of the NN datasets?


Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Steve Loughran

Brian Bockelman wrote:


* When one disk goes out, the datanode shuts down - meaning that 48 
disks go out.  This is to be fixed in 0.21.0, I think.


That's right, though the NN doesn't report it, and I think once offline, 
that disk stays offline.


There's been discussion on a new JIRA issue regarding hotswap support, 
can I pull a disk out, plug in a new one and expect it to be possible to 
have the replacement brought in. You don't want to shut down 48TB just 
to replace one disk.


To be done properly, you really need a way to tell HDFS what you are 
doing, to make sure that any underreplicated data is pulled off that HDD 
onto the remaining disks, pause, kill the tasktracker. Once the new disk 
is inserted and mounted, it needs to be repopulated with blocks


-steve


Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Steve Loughran

Ryan Smith wrote:

I have a question that i feel i should ask on this thread.  Lets say you
want to build a cluster where you will be doing very little map/reduce,
storage and replication of data only on hdfs.  What would the hardware
requirements be?  No quad core? less ram?



Servers with more HDD per CPU, and less RAM. CPUs are a big slice not 
just of capital, but of your power budget. If you are running a big 
datacentre, you will care about that electricity bill.


Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or 12 TB 
per U, then perhaps a 2-core or 4-core server with "enough" ECC RAM.


* with less M/R work, you could allocate most of that TB to work, leave 
a few hundred GB for OS and logs


* you'd better estimate external load; if the cluster is storing data 
then total network bandwidth will be 3X the data ingress (for 
replication = 3), read costs are that of the data itself. Also, 5 
threads on 3 different machines handing the write and forward process.


* I don't know how much load the datanode JVM would take with, say 11 TB 
of managed storage underneath; that's memory and CPU time.


Is anyone out there running big datanodes? What do they see?

-steve



Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Steve Loughran

Kevin Sweeney wrote:

I really appreciate everyone's input. We've been going back and forth on the
server size issue here. There are a few reasons we shot for the $1k price,
one because we wanted to be able to compare our datacenter costs vs. the
cloud costs. Another is that we have spec'd out a fast Intel node with
over-the-counter parts. We have a hard time justifying the dual-processor
costs and really don't see the need for the big server extras like
out-of-band management and redundancy. This is our proposed config, feel
free to criticize :)
Supermicro 512L-260 Chassis $90
Supermicro X8SIL  $160
Heatsink$22
Intel 3460 Xeon  $350
Samsung 7200 RPM SATA2   2x$85
2GB Non-ECC DIMM  4x$65

This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
purpose of a hadoop cluster to build cheap,fast, replaceable nodes?


Disclaimer 1: I work for a server vendor so may be biased. I will 
attempt to avoid this by not pointing you at HP DL180 or SL170z servers.


Disclaimer 2: I probably don't know what I'm talking about. As far as 
Hadoop concerned, I'm not sure anyone knows what is "the right" 
configuration.


* I'd consider ECC RAM. On a large cluster, over time, errors occur -you 
either notice them or propagate the effects.


* Worry about power, cooling and rack weight.

* Include network costs, power budget. That's your own switch costs, 
plus bandwidth in and out.


* There are some good arguments in favour of fewer, higher end machines 
over many smaller ones.  Less network traffic, often a higher density.


The  cloud hosted vs owned is an interesting question; I suspect the 
spreadsheet there is pretty complex


* Estimate how much data you will want to store over time. On S3, those 
costs ramp up fast; in your own rack you can maybe plan to stick in in 
an extra 2TB HDD a year from now (space, power, cooling and weight 
permitting), paying next year's prices for next year's capacity.


* Virtual machine management costs are different from physical 
management costs, especially if you dont invest time upfront on 
automating your datacentre software provisioning (custom RPMs, PXE 
preboot, kickstart, etc). VMMs you can almost hand manage an image 
(naughty, but possible), as long as you have a single image or two to 
push out. Even then, i'd automate, but at a higher level, creating 
images on demand as load/availablity sees fit.


-Steve




Re: Running Hadoop on cluster with NFS booted systems

2009-09-30 Thread Steve Loughran

Brian Bockelman wrote:


On Sep 30, 2009, at 4:24 AM, Steve Loughran wrote:


Todd Lipcon wrote:
Yep, this is a common problem. The fix that Brian outlined helps a 
lot, but
if you are *really* strapped for random bits, you'll still block. 
This is

because even if you've set the random source, it still uses the real
/dev/random to grab a seed for the prng, at least on my system.


Is there anyway to test/timeout for this on startup and respond?



The amount of available entropy is recorded in this file:

/proc/sys/kernel/random/entropy_avail

That's the number of bytes available in the entropy pool.  From what I 
can see, 200 is considered a low number.  It appears that the issue is 
deep within Java's security stack.  I'm not sure how easy it is to turn 
it into non-blocking-I/O.  If you've got a nice fat paid contract with 
Sun, you might have a chance...


I could certainly do something to get that value and worry if it is low. 
But where to add more.


Brian, you are the physicist, do you have any strongly random numbers 
for us to use?


-steve


Re: ask help for hsql conflict problem.

2009-09-30 Thread Steve Loughran




On Tue, Sep 29, 2009 at 4:58 PM, Jianwu Wang  wrote:


Hi there,

  When I have hadoop running (version 0.20.0, Pseudo-Distributed Mode), I
can not start my own java application. The exception complains that
'java.sql.SQLException: failed to connect to url
"jdbc:hsqldb:hsql://localhost/hsqldb". I have to stop hadoop to start my own
java application. Both my application and hadoop use hsqldb. Does anyone
know why and can help me out of this problem?  Or tell me where to look for
hadoop hsqldb connection implementation, like which kind of hsql server mode
is used in hadoop and what is the default url of hsqldb for hadoop? I looked
into org.apache.hadoop.mapred.lib.db package but didn't find any clue.


HSQL is a single threaded database built into the JDBC driver, >1 app 
cannot use the same database, indeed, >1 thread cannot use it as every 
JDBC call will block. Use a real database.


Re: Running Hadoop on cluster with NFS booted systems

2009-09-30 Thread Steve Loughran

Todd Lipcon wrote:

Yep, this is a common problem. The fix that Brian outlined helps a lot, but
if you are *really* strapped for random bits, you'll still block. This is
because even if you've set the random source, it still uses the real
/dev/random to grab a seed for the prng, at least on my system.



Is there anyway to test/timeout for this on startup and respond?

At the very least, a new JIRA issue should be opened for this with the 
stack trace and workaround, so that people have something to search on


Re: How 'commodity' is 'commodity'

2009-09-30 Thread Steve Loughran

Edward Capriolo wrote:


There is many ways to look at it, but 1 if you are just running Task
Trackers and not a DataNode on your workstations you have 0 data
locality. That is a bad thing because after all hadoop wants to move
the processing close to the data.  If the disks backing the DataNode
and TaskTracker are not fast and multi-threaded that node is not going
to perform well.


It puts more load on the (few) disks and network load, but if you have 
something CPU intensive -by which I mean does a fair amount of work for 
every byte read/written, then adding extra compute-only nodes can be handy.



Now it does sound like your workstations have more processing power
then my test cluster so you might have better results.

Personally, I would probably try Hadoop on windows or colinux instead
of VMWARE. VMWare has to emulate disk drives, kernels, interrupts.
IMHO that overhead was too much, I do not know if anyone has any
numbers on it.


local disk IO is the thing that hurts, because of the emulation and 
because the virtual HDD may not be laid out linearly on the host HDD. 
Network traffic is not quite as bad. With the latest x86 parts, much of 
the virtualisation overhead is eliminated, at least for in-RAM 
app-and-OS work.


What that VM does give is lower sysadmin costs, you don't need to 
install and maintain hadoop-on-windows for everyone, just a shared VM 
that can be updated regularly


In this paper
http://www.cca08.org/papers/Poster10-Simone-Leo.pdf
They mention XEN overhead seems to be 5%. I would think that VMWare
virtualization would perform worse but try it yourself and let me
know!


VMWare on linux does put more system load on than virtualbox, though 
RHEL5.x and Centos5.x when hosted on virtuabox has some timers that are 
continually keeping things busy.


Re: How 'commodity' is 'commodity'

2009-09-29 Thread Steve Loughran

Edward Capriolo wrote:


In hadoop terms commodity means "Not super computer". If you look
around most large deployments have DataNodes with duel quad core
processors 8+GB ram and numerous disks, that is hardly the PC you find
under your desk.


I have 4 cores and 6GB RAM, but only one HDD on the desk. That ram gets 
sucked up by the memory hogs: IDE, firefox and VMWare. The fact that 
VMWare uses less RAM to host an entire OS image than firefox shows that 
you can get away with virtualised work, or that firefox is overweight.


1886220   0 2139m 1.4g  48m S0 25.0  46:57.81 java 

1282220   0 1050m 476m  18m S3  8.1  53:34.39 firefox 

1303720   0  949m 386m 365m S   11  6.5 206:34.70 vmware-vmx 


2093220   0  688m 374m  11m S0  6.4   2:05.18 java


For example, we had a dev clusters is a very modest setup, 5 HP DL
145. Duel Core 4 GB RAM, 2 SATA DISKS.

I did not do any elaborate testing, but i found that:

One DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5
node cluster. So it is possible but it might be more administrative
work then it is worth.


Interesting. Is this CPU or IO intensive work?


Re: How 'commodity' is 'commodity'

2009-09-29 Thread Steve Loughran


"commodity" really means x86 parts, non-RAID storage, no 
infiniband-connected storage array, no esoteric OS -just Linux- and 
commodity gigabit ether, nothing fancy like 10GBE except on a 
heavy-utilised backbone :) With those kind of configurations, you reduce 
your capital costs, leaving you more money to spend on the electricity 
bill. I'd still go for RAID and/or NFS-mounted  RAID for bits of the 
namenode/2ary namenode if you care about the data.


Taeho Kang wrote:

If your "commodity" pc's don't have a whole lot of storage space, then you
would have to run your HDFS datanodes elsewhere. In that case, a lot of data
traffic will occur (e.g. sending data from datanodes to where data
processing occurs), meaning map reduce performance will be slowed down. It's
always good to have the actual data on the same machine where the processing
will occur, or there will be extra network i/o involved.

If you decide to host datanodes on pc's, then you also have to be able to
protect the data. (e.g. make sure people don't accidentally delete data
blocks.)

Well, there are lots and lots of possibilities, and I would like to hear how
your plan goes, too!


I would go for storing data off the desktop machines, and just using 
them as compute nodes -tasktrackers. This reduces the impact of them 
going offline without warning but lets them do useful work. This will 
bump up their bandwidth needs though.


This still leaves you with the problem of configuring the hadoop cluster 
for all these machines, especially if they are different. To work around 
that, why not creating a VirtualBox or VMWare OS image containing the 
hadoop binaries and configuration files. Everyone who runs the OS image 
joins the cluster, but as soon as they pause it, that tasktracker goes away.


When run Virtualized, HDD and network IO is slower, but if you are only 
connecting to network storage, that network throttling could be useful, 
it will cut back on LAN bandwidth. CPU performance can often be 
comparable, so if your code is CPU-intensive, this can work


Re: Limiting the total number of tasks per task tracker

2009-09-25 Thread Steve Loughran

Oliver Senn wrote:

Hi,

Thanks for your answer.

I used these parameters. But they seem to limit only the number of 
parallel maps and parallel reduces separately. They do not prevent the 
scheduler from schedule one map and one reduce on the same task tracker 
in parallel.


But that's the problem I'm trying to solve. Having at most one task 
running on a task tracker at any time (never one map and one reduce 
together on one task tracker).




You could do your own scheduler, if you really can't handle the things 
in parallel


Re: 3D Cluster Performance Visualization

2009-09-25 Thread Steve Loughran

Brian Bockelman wrote:
;) Unfortunately, I'm going to go out on a limb and guess that we don't 
want to add OpenGL to the dependency list for the namenode...  The viz 
application actually doesn't depend on the namenode, it uses the datanodes.


Here's the source:
svn://t2.unl.edu/brian/HadoopViz/trunk

The server portion is a bit hardcoded to our site (simply a python 
server); the client application is pretty cross-platform.  I actually 
compile and display the application on my Mac.


Here's how it works:

1) Client issues read() request
2) Datanode services it.  Logs it with log4j
3) One of the log4j appenders is syslog pointing at a separate server
4) Separate log server recieves UDP packets; one packet per read()
5) Log server parses packets and decides whether they are within the 
cluster or going to the internet
  - Currently a Pentium 4 throw-away machine; handles up to 4-5k packets 
per second before it starts dropping
6) Each client opens a TCP stream to the server and receives the 
transfer type, source, and dest, then renders appropriately


It's pretty danged close to real-time; the time the client issues the 
read() request to seeing something plotted is on the order of 1 second.


I'd really like to see this on a big (Yahoo, Facebook, any takers?) 
cluster.


Brian




Ok, so this is really an example of a datacentre back-end for Log4J, 
pushing out UDP packets to something else in the datacentre.  A nice 
side-line to the classic hadoop management displays. Add something about 
jobs executing and you are laughing. Do it all in Java3D and you even 
have cross platformness


Re: local node Quotas (for an R&D cluster)

2009-09-25 Thread Steve Loughran

Paul Smith wrote:


On 25/09/2009, at 8:55 PM, Steve Loughran wrote:



I'd love to see more direct Log4J/Hadoop integration, such as a 
standardised log4j-in-hadoop format that was easily readable, included 
stack traces on exceptions, etc, and came with some sample mapreducer 
or pig scripts to analyse.



I have been mulling over just that sort of thing.  Some sort of 
HadoopAppender that outputs files in the SequenceFile format and 
periodically submits to a DFS node.  Hey,   this is something I _can_ 
contribute!  I might start another thread.  Thanks.


Paul


Have a look at the Chukwa work, think how you could use that style of 
distributed aggregation -and what analysis you could do with it. One 
thing to consider is there is no reason why you can't use Chukwa on HDFS 
to monitor non-hadoop applications -anything that runs in the datacentre 
is a source of log data, whether it is Tomcat or Jetty, or the back end 
code.


We welcome your contribution. I even volunteer to committing the code, 
if it doesn't get vetoed by everyone else.


One big difference between Hadoop and, say,  Log4J, is that Facebook and 
Yahoo! know that Hadoop is a Line-of-Business application. If any patch 
were to lose their data, they would cease to exist. So they worry about 
patches in a way that no other ASF project I've ever come across does. 
There's no correction of spelling mistakes in variable names here, not 
without a proper patch and review.  That makes it less agile than many 
other apache projects, but what it does ensure is that the overall 
system is good to use, that you really can trust your many TB of data to 
the filesystem. Nothing else in apache-land has this reponsibility. 
HTTPD has security reponsibilities, but note even that project has to 
worry about preserving 14PB of file system data during an upgrade.


-steve


Re: local node Quotas (for an R&D cluster)

2009-09-25 Thread Steve Loughran

Paul Smith wrote:


On 25/09/2009, at 3:57 PM, Allen Wittenauer wrote:





On 9/24/09 7:38 PM, "Paul Smith"  wrote:


"I think this could be one of these "If we build it, they will come"
issues. most of the Hadoop committers are working in large scale
homogenous environments (lucky them).


It should be noted that Joydeep was (and still is, AFAIK) at Facebook 
and I

was at Yahoo! at the time.  So that should tell you something. ;)



eek. Our plan is doomed.

I guess a secondary question goes to you as to how one could possibly 
manage large clusters with differing space requirements?  What does this 
feature mean to you with your background?  Here's little ole me thinking 
"this is probably such a dumb question to ask all these smart people who 
work at large sites".



Brian Bockleman's "Using Hadoop as a Grid Storage Element" talks about 
an interesting setup here:

http://www.iop.org/EJ/article/1742-6596/180/1/012047/jpconf9_180_012047.pdf

Some nodes: 80Gb, others 48TB. That's heterogenous, and where %-free 
parameters don't cut it




I guess I come from a place whereby things grow incrementally, we can't 
possibly afford to go buy a dozen racks of servers in one hit, so we 
would add resource incrementally, and provision their usage as needed, 
so this feature gives us some power (even though initially we'd probably 
go with the standard use cases).  As I write that perhaps other large 
clusters have been formed the same way. "We started with a dozen, and 
then somehow managed to get to 100".  Those early servers in the cluster 
are likely no longer the same configuration, but if one is prepared to 
give away the whole node still, the standard "fill it up till it runs 
out of space" still holds, the reserved space metric is there to allow 
enough space for MR programs to spill to disk.


Hadoop loves homogeneity and stability, it's in the task slots 
allocation, the scheduling, as well as the HDFS. I'd like to see more 
support for "evolved" clusters, and for clusters where short-lived TT's 
can be brought in nearby, TTs with no HDFS storage at all, and which can 
be decomissioned cleanly (Jt stops scheduling work), or taken down by 
the infrastructure with no  warning whatsoever.


But before I get involved in those issues, I need to
* get my lifecycle patch in for easy shutdown and restart with new 
configurations


* more dynamic configuration support. That means more explicit config 
sources than just static xml files, and more understanding in the nodes 
that config may change. For example, when the JT spins waiting for the 
namenode to come up it should reread the namenode URL every iteration, 
in case it has changed; the Task Trackers and Datanodes should do the same.


Your needs are related to mine and many others, it's just everyone has 
different priorities.




I'm still new to Hadoop and learning the terminology, so it's still not 
100% clear where non-DFS, but -still-Hadoop disk usage comes into play 
(is there a good page in this? I have the Definitive book, but that 
doesn't go into this sort of detail).  I'm a log4j committer, so I'd 
love to think I can give back to Hadoop, but I still expect this is 
outside my capabilities at this stage.


I'd love to see more direct Log4J/Hadoop integration, such as a 
standardised log4j-in-hadoop format that was easily readable, included 
stack traces on exceptions, etc, and came with some sample mapreducer or 
pig scripts to analyse.


-Steve


Re: HADOOP-4539 question

2009-09-21 Thread Steve Loughran

Edward Capriolo wrote:



Just for reference. Linux HA and some other tools deal with the split
brain decisions by requiring a quorum. A quorum involves having a
third party or having more then 50% of the nodes agree.

An issue with linux-ha and hadoop is that linux-ha is only
supported/tested on clusters of up to 16 nodes. 


Usually odd numbers; stops a 50%-50% split.


That is not a hard
limit, but no one claims to have done it on 1000 or so nodes.


If the voting algorithm requires communication with every node then 
there is an implicit limit.




You
could just install linux HA on a random sampling of 10 nodes across
your network. That would in theory create an effective quorum.






There are other HA approaches that do not involve DRBD. One is store
your name node table on a SAN or and NFS server. Terracotta is another
option that you might want to look at. But no, at the moment there is
no fail-over built into hadoop.


Storing the only copy of the NN data into NFS would make the NFS server 
an SPOF, and you still need to solve the problems of  -detecting NN 
failure and deciding who else is in charge
-making another node the NN by giving it the same hostname/IPAddr as the 
one that went down.


That is what the linux HA stuff promises

-steve


Re: Can not stop hadoop cluster ?

2009-09-21 Thread Steve Loughran

Jeff Zhang wrote:

My cluster has running for several months.


Nice.


Is this a bug of hadoop? I think hadoop is supposed to run for long time.


I'm doing work in HDFS-326 on making it easier to start/stop the various 
hadoop services; once the lifecycle stuff is in I'll worry more about 
the remote management and configuration aspects.



And will I lose data if I manually kill the process ?


-If you let all the work run out before  killing the jobtracker, then no 
work in progress will be lost


for the filesystem I'd say "probably not". Have you been running a 
secondary namenode? If so, restart time will be less than if you were not.


If it is just the PID files missing, you could recreate them by
-working out the process ID (via jps -v )
-creating the files by hand

-steve




Re: Hadoop on Windows

2009-09-17 Thread Steve Loughran

Bill Habermaas wrote:

It's interesting that Hadoop, being written entirely in Java, has such a
spotty reputation running on different platforms. I had to patch it to run
on AIX and need cygwin (gack!) so it will run on Windows. I'm surprised
nobody has thought about removing it's use of bash to run system commands
(which is NOT especially portable). Now that Hadoop only comes only in a
Java 1.6 flavor why can't it figure out disk space using the native java
runtime instead of executing the DF command under bash? Of course it runs
other system commands as well which in my opinion isn't too cool. 




It is run at scale on big linux systems, and they are the ones that 
encounter problems with 16GB heaps and exec(), various other JVM quirks 
that lead the developers to say Linux + Sun JVM only. You are free to 
use other operating systems and even JVMs (I've used JRockit with some 
minor logging problems in test runs), but you get to encounter the 
problems. You can and should submit patches back, but if you diverge 
from the approved standard, you get to retest at scale, because nobody 
else is going to do it for you.


Supporting different unix versions is much easier than supporting 
windows+linux/unix, especially if you are trying to do high availability 
stuff, integrate with management tools, etc. I think it would be nice if 
Hadoop would build and run standalone on Windows without cygwin, but for 
all other actions, a more ruthless "Unix-ish only" would be harsh but 
make it easier to manage problems.


Even in a Linux-only world, you are left with the "which distro", 
question -were there to be official apache Hadoop RPMs and .deb files, 
there'd be discussions about which platforms to support. RHEL+Centos 5.X 
would be the obvious choice, but what else?


-steve


Re: Stretched HDFS cluster

2009-09-17 Thread Steve Loughran

Edward Capriolo wrote:




On a somewhat related topic I was showing a co-worker a Hadoop setup
and he asked stated, "What if we got a bunch of laptops on the
internet like the playstation 'Folding @ Home'" of course these are
widely different distributed models.

I have been thinking about this. Assume:
Throw the data security out the window, and assume everyone is playing fair.
Assume we have systems with a semi-dedicated IP, like my cable
internet. with no inbound/outbound restrictions.
Assume every computer is its own RACK
LAN is very low latency Assume that latency is like 40 ms
Assume we tune up replication to say 10 or higher to deal with drop on/drop offs

Could we run a 100 Node cluster? If no what is stopping it from working?

My next question. For fun, does anyone want to try? We could setup
IPTABLES/firewall allowing hadoop traffic from IP's in the experiment.
I have two nodes in Chicago, US ISP to donate. Even if we get 10 nodes
that would be interesting as a benchmark.



you could see about getting planet-lab time for something like this, 
though the way they allocate partially virtualized slices, you can't be 
100% sure that you get the ports you ask for. This will complicate binding.


Re: Hadoop on Windows

2009-09-17 Thread Steve Loughran

brien colwell wrote:

Our cygwin/windows nodes are picky about the machines they work on. On
some they are unreliable. On some they work perfectly.

We've had two main issues with cygwin nodes.

Hadoop resolves paths in strange ways, so for example /dir is
interpreted as c:/dir not %cygwin_home%/dir. For SSH to a cygwin node,
/dir is interpreted as %cygwin_home%/dir. So our maintenance scripts
have to make a distinction between cygwin and linux to adjust for
Hadoop's path behavior.



That's exactly the same as any Java File instance would work on windows, 
new File("/dir") would map to c:/dir.


As the Ant team say in their docs
"We get lots of support calls from Cygwin users. Either it is incredibly 
popular, or it is trouble. If you do use it, remember that Java is a 
Windows application, so Ant is running in a Windows process, not a 
Cygwin one. This will save us having to mark your bug reports as invalid. "


Re: Stretched HDFS cluster

2009-09-16 Thread Steve Loughran

Touretsky, Gregory wrote:

Hi,

Does anyone have an experience running HDFS cluster stretched over 
high-latency WAN connections?
Any specific concerns/options/recommendations?
I'm trying to setup the HDFS cluster with the nodes located in the US, Israel 
and India - considering it as a potential solution for cross-site data 
sharing...



I would back up todd here and say "don't do it -yet". I think there are 
some minor placeholders in the rack hierarchy to have an explicit notion 
of different sites, but nobody has done the work yet. Cross datacentre 
data balancing and work scheduling is complex, and all the code in 
Hadoop, zookeeper, etc, is built on the assumption that latency is low, 
all machines clocks are going forward at roughly the same rate, the 
network is fairly reliable, routers are unlikely to corrupt data, etc.


Now, if you do want to do >1 site, it would be a profound and useful 
development -I'd expect the MR scheduler, or even the Pig/Hive code 
generators , to take datacentre locality into account, doing as much 
work per site as possible. The problem of block distribution changes 
too, as you would want 1 copy of each block in the other datacentre. 
Even then, I'd start with sites in a single city, on a MAE or other link 
where bandwidth matters less. Note that (as discussed below) on the MAN 
scale things can start to go wrong in ways that are unlikely in a 
datacentre, and its those failures that will burn you


worth reading
http://status.aws.amazon.com/s3-20080720.html
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

-Steve


Re: Best practices with large-memory jobs

2009-09-16 Thread Steve Loughran

Chris Dyer wrote:

my task logs I see the message:
"attempt to override final parameter: mapred.child.ulimit;  Ignoring."
which doesn't exactly inspire confidence that I'm on the right path.

Chances are the param has been marked final in the task tracker's running
config which will prevent you overriding the value with a job specific
configuration.

Do you have any idea how one unmarks such a thing?  Do I just need to
edit the configuration file for the task tracker?


Depending upon how many tasks per node, that may not be enough. Streaming
jobs eat a crapton (I'm pretty sure that is an SI unit) of memory.  If you

Is there any particular reason for the excessive memory use?  I
realize this is Java, but it's just sloshing data down to my
processes...



Java6u14 + lets you run with "compressed pointers"; everyone is still 
playing with that but it does appear to reduce 64-bit  memory use. If 
you were using 32 bit JVMs, stay with them, as even with compressed 
pointers, 64 bit JVMs use more memory per object instances.



How does one change the number of map slots per
node?  I'm a hadoop configuration newbie (which is why I was
originally excited about the Cloudera EC2 scripts...)


From the code in front of my IDE

   maxMapSlots = conf.getInt("mapred.tasktracker.map.tasks.maximum", 2);
maxReduceSlots = 
conf.getInt("mapred.tasktracker.reduce.tasks.maximum", 2);


Those are conf values you have to tune.


Re: hadoop 0.20.0 jobtracker.info could only be replicated to 0 nodes

2009-09-10 Thread Steve Loughran

gcr44 wrote:

Thanks for the response.

I have already tried moving JobTracker to several different ports always
with the same result.


Chandraprakash Bhagtani wrote:

You can try running JobTracker on some other port. This port might me in
use.

--
Thanks & Regards,
Chandra Prakash Bhagtani,

On Thu, Sep 10, 2009 at 2:58 AM, gcr44  wrote:


All,

I'm setting up my first hadoop full cluster.  I did the cygwin thing and
everything works.  I'm having problems with the cluster.

The cluster is five nodes of matched hardware running Ubuntu 8.04.  I
believe I have ssh working properly. The master node is named hbase1, but
I'm not doing anything with hbase.

I run start-dfs.sh, Jps shows NameNode running, and the logs are free of
error.  The data nodes, however, appear to be complaining.  "Retrying
connect to server: hbase1"

I run start-mapred.sh, Jps shows NameNode and JobTracker running.

The namenode log says, "jobtracker.info could only be replicated to 0
nodes,
instead of 1".

The jobtracker log says two things of significance:

1. "It might be because the JobTracker failed to read/write system files
(hdfs://hbase1:3/hdfs/mapred/system/jobtracker.info /
hdfs://hbase1:3/hdfs/mapred/system/jobtracker.info.recover) or the
system  file hdfs://hbase1:3/hdfs/mapred/system/jobtracker.info is
missing!"


don't worry about the job tracker until you have HDFS -that is namenodes 
and datanodes, up and running. Do you have any datanodes up? Because 
complaints about not enough replication and missing files mean the 
filesystem isn't live yet


Re: measuring memory usage

2009-09-10 Thread Steve Loughran

Arvind Sharma wrote:
hmmm... I had seen some exceptions  (don't remember which one) on MacOS. There was missing JSR-223 engine on my machine. 


Not sure why on Linux distribution you would see this error





From: Ted Yu 
To: common-user@hadoop.apache.org
Sent: Wednesday, September 9, 2009 9:05:57 AM
Subject: Re: measuring memory usage

Linux vh20.dev.com 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007
i686 i686 i386 GNU/Linux

/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0


Hadooop isn't supported on anything other than sun-jdk; openjdk is 
slightly different, though less so than the other runtimes.  Get the sun 
packages and see if it goes away. Also, 1.6.0.0 is a few security 
patches ago, 1.6u15 or 1.6u16 is what you should be looking for


Re: Pregel

2009-09-08 Thread Steve Loughran

Ted Dunning wrote:

You would be entirely welcome in Mahout.   Graph based algorithms are key
for lots of kinds of interesting learning and would be a fabulous thing to
have in a comprehensive substrate.

I personally would also be very interested in learning more about about what
sorts of things Pregel is doing.  It is relatively easy to build simple
graph algorithms on top of Map-reduce, but these algorithms typically
require a map-reduce iteration to propagate information.  Good algorithms
for that architecture have exponential propagation so that you don't need a
huge number of iterations.  It smelled like Pregel was doing something much
more interesting.



Exactly, it is not pushing bits of the graph around. Instead it has 
partitioned the graph to different machines, and is pushing the work out 
to the relevant bits of the graph, a sort of GraphReduce. I believe, not 
having seen the code myself :)


Re: Datanode high memory usage

2009-09-02 Thread Steve Loughran

Stas Oskin wrote:

Hi.


It would be nice if Java 6 had a way of switching compressed pointers on by
default -the way JRockit 64 bit did. Right now you have to edit every shell
script to start up every program,  hadoop included.  Maybe when jdk7 ships
it will do this by default.



Does it give any memory benefits on Datanode as well, in addition to
Namenode?


All we have so far is in  https://issues.apache.org/jira/browse/HDFS-559

If you are willing to work out the memory savings in the datanode, with 
different block sizes and file counts, it would be welcome.




 Also, just how much RAM is gained by this setting?



Again, experimentation is needed. All we can say is that the 
datastructures get smaller, but not as small as 32-bit systems, as every 
instance is aligned to 8-byte/64-bit boundaries.


Re: Datanode high memory usage

2009-09-02 Thread Steve Loughran

Allen Wittenauer wrote:

On 9/2/09 3:49 AM, "Stas Oskin"  wrote:

It's a Sun JVM setting, not something Hadoop will control.  You'd have to
turn it on in hadoop-env.sh.



Question is, if Hadoop will include this as standard,  if it indeed has such
benefits.


We can't do this as then if you try and bring up Hadoop on a VM without 
this option (currently, all OS/X JVMs), your java program will not start.




Hadoop doesn't have a -standard- here, it has a -default-.  JVM settings are
one of those things that should just automatically be expected to be
adjusted on a per installation basis.  It is pretty much impossible to get
it correct for everyone.  [Thanks Java. :( ]



It would be nice if Java 6 had a way of switching compressed pointers on 
by default -the way JRockit 64 bit did. Right now you have to edit every 
shell script to start up every program,  hadoop included.  Maybe when 
jdk7 ships it will do this by default.


Re: Cloudera Video - Hadoop build on eclipse

2009-09-01 Thread Steve Loughran

ashish pareek wrote:

Hello Bharath,

   Earlier even I faced the same problem. I think your are
accessing internet through proxy.So try using direct broadband connection.
Hope this will solve your problem.


or set Ant's proxy up
http://ant.apache.org/manual/proxy.html




Ashish Pareek

On Fri, Aug 28, 2009 at 4:46 PM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:


Hi all,

Iam trying to build hadoop on eclipse with the help of Cloudera Video on
it's site . I have successfully checkedout from the hadoop svn .. Then the
problem is iam not able to build the file "build.xml" using ant build. I am
getting the error

ivy-download:
 [get] Getting:

http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar
 [get] To: /home/rip/workspace/hadoop-trunk/ivy/ivy-2.0.0-rc2.jar
[get] Error getting

http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jarto
/home/rip/workspace/hadoop-trunk/ivy/ivy-2.0.0-rc2.jar

BUILD FAILED
/home/rip/workspace/hadoop-trunk/build.xml:1174: java.net.ConnectException:
Connection timed out

Total time: 3 minutes 10 seconds





Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-25 Thread Steve Loughran

Brian Bockelman wrote:


On Aug 24, 2009, at 5:42 AM, Steve Loughran wrote:


Raghu Angadi wrote:

Suresh had made an spreadsheet for memory consumption.. will check.
A large portion of NN memory is taken by references. I would expect 
memory savings to be very substantial (same as going from 64bit to 
32bit), could be on the order of 40%.
The last I heard from Sun was that compressed pointers will be in 
very near future JVM (certainly JDK 1.6_x). It can use compressed 
pointers upto 32GB of heap.


It's in JDK 1.6u14. Looking at the source and reading the specs 
implies there is savings, but we need to experiment to see. I now know 
how to do sizeof() in java, (in the instrumentation API), so these 
experiments are possible




Hey Steve,

I'm a bit dumb with Java (so this might be something you already know), 
but last week I discovered the "jhat" tool.  You can dump the Java stack 
with JMX to a file, then use jhat to build a little webserver that 
allows you to explore your heap.


One page it provides is a table histogram of the instance counts and # 
of bytes per class.


I've worked out how to do sizeof, got my first results during last 
night's london HUG event,

all the data is going into https://issues.apache.org/jira/browse/HDFS-559




This helped me a lot when I was trying to track memory leaks in libhdfs.


I would expect runtime over head on NN would be minimal in practice.



I think there's a small extra deref cost, but its very minimal; one 8 
bit logical shift left, possibly also an addition. Both of which run 
at CPU-speeds, not main memory bus rates


One interesting tidbit (from my memory of a presentation 2 months ago... 
I might have the numbers wrong, but the general message is the same):


On petascale-level computers, the application codes' CPU instructions 
are about 10% floating point (that is, in scientific applications, there 
are less floating point instructions than in most floating point 
benchmarks).  Of the remaining instructions, about 1/3 are 
memory-related and 2/3 are integer.  Of the integer instructions, 40% 
are computing memory locations.


cool. I wonder what percentage is vtable/function lookup in OO code 
versus data retrieval? After all, every time you read or write an 
instance field in a class instance, there is the this+offset maths 
before the actual retrieval, though there is a fair amount of CPU 
support for such offset operations




So, on the biggest DOE computers, about 50% of the CPU time is spent on 
memory-related computations.  I found this pretty mind-boggling when I 
learned this.  It seems to me that the "central" part of the computer is 
becoming the bus, not the CPU.


welcome to the new bottlenecks.





Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-24 Thread Steve Loughran

Scott Carey wrote:


The implementation in JRE 6u14 uses a shift for all heap sizes, the
optimization to remove that for heaps less than 4GB is not in the hotspot
version there (but will be later).


OK. I've been using JRockit 64 bit for a while, and it did a check on 
every pointer to see if it was real or relative, then an add, so I 
suspect its computation was more complex. The sun approach seems better



The size advantage is there either way.

I have not tested an app myself that was not faster using
-XX:+UseCompressedOops on a 64 bit JVM.
The extra bit shifting is overshadowed by how much faster and less frequent
GC is with a smaller dataset.


Excellent.

You get better cache efficiency too -less cache misses, and save on 
memory bandwidth


Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-24 Thread Steve Loughran

Raghu Angadi wrote:


Suresh had made an spreadsheet for memory consumption.. will check.

A large portion of NN memory is taken by references. I would expect 
memory savings to be very substantial (same as going from 64bit to 
32bit), could be on the order of 40%.


The last I heard from Sun was that compressed pointers will be in very 
near future JVM (certainly JDK 1.6_x). It can use compressed pointers 
upto 32GB of heap.


It's in JDK 1.6u14. Looking at the source and reading the specs implies 
there is savings, but we need to experiment to see. I now know how to do 
sizeof() in java, (in the instrumentation API), so these experiments are 
possible




I would expect runtime over head on NN would be minimal in practice.



I think there's a small extra deref cost, but its very minimal; one 8 
bit logical shift left, possibly also an addition. Both of which run at 
CPU-speeds, not main memory bus rates


<    1   2   3   4   >