from:"Steve Loughran"

Re: [help]how to stop HDFS

2011-11-30 Thread Steve Loughran


On 30/11/11 04:29, Nitin Khandelwal wrote:

Thanks,
I missed the sbin directory, was using the normal bin directory.
Thanks,
Nitin

On 30 November 2011 09:54, Harsh Jha...@cloudera.com  wrote:


Like I wrote earlier, its in the $HADOOP_HOME/sbin directory. Not the
regular bin/ directory.

On Wed, Nov 30, 2011 at 9:52 AM, Nitin Khandelwal
nitin.khandel...@germinait.com  wrote:

I am using Hadoop 0.23.0
There is no hadoop-daemon.sh in bin directory..



I found the 0.23 scripts to be hard to set up, and get working

https://issues.apache.org/jira/browse/HADOOP-7838
https://issues.apache.org/jira/browse/MAPREDUCE-3430
https://issues.apache.org/jira/browse/MAPREDUCE-3432

I'd like to see what Bigtop will offer in this area, as their test 
process will involve installing onto system images and walking through 
the scripts. the basic hadoop tars assume your system is well configured 
and you know how to do this -and debug problems

Re: How is network distance for nodes calculated

2011-11-23 Thread Steve Loughran


On 22/11/11 21:04, Edmon Begoli wrote:

I am reading Hadoop Definitive Guide 2nd Edition and I am struggling
to figure out the exact
Hadoop's formula for network distance calculation (page 64/65). (I
have my guesses, but I would like to know the exact formula)


It's implemented in org.apache.hadoop.net.NetworkTopology

It's measuring the #of network hops to get there, 2 = n1 - switch1 - n2

etc

Re: Adding a new platform support to Hadoop

2011-11-17 Thread Steve Loughran


On 17/11/11 15:02, Amir Sanjar wrote:

Is there any specific development, build, and packaging guidelines to add
support for a new hardware platform, in this case PPC64, to hadoop?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858



this is something to take up on the -dev lists, not the user lists, 
especially  common-...@hadoop.apache.org


One problem with any platform is the native code: nobody but you is 
going to build or test it. The only JVM currently recommended is the Sun 
JVM, so again, you will get to test there. This means you are going to 
have to be active testing releases against your target platform. 
Otherwise it will languish in the not really meant to be used in 
production category of things.


The apache releases (which are meant to be source distributions anyway; 
the binary artifacts are just an extra), but you will need to work with 
the dev team to make sure the native libraries build properly

Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-25 Thread Steve Loughran


On 24/10/11 23:46, Mark question wrote:

Thank you, I'll try it.
Mark

On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooquicassandral...@gmail.comwrote:


Mark,

We figured it out. It's an issue with RedHat's IPTables. You have to open
up
those ports:


vim /etc/sysconfig/iptables


Of course, if you open up the cluster ports to everyone, that means 
everyone else with an IPv4 address. SSH tunnelling is a better tactic

Re: execute hadoop job from remote web application

2011-10-20 Thread Steve Loughran


On 18/10/11 17:56, Harsh J wrote:

Oleg,

It will pack up the jar that contains the class specified by
setJarByClass into its submission jar and send it up. Thats the
function of that particular API method. So, your deduction is almost
right there :)

On Tue, Oct 18, 2011 at 10:20 PM, Oleg Ruchovetsoruchov...@gmail.com  wrote:

So you mean that in case I am going to submit job remotely and
my_hadoop_job.jar
will be in class path of my web application it will submit job with
my_hadoop_job.jar to
remote hadoop machine (cluster)?




There's also the problem of waiting for your work to finish. If you want 
to see something complicated that does everything but JAR upload, I have 
some code here that listens for events coming out of the job and so 
builds up a history of what is happening. It also does better preflight 
checking of source and dest data directories


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/mapreduce/submitter/SubmitterImpl.java

Re: automatic node discovery

2011-10-18 Thread Steve Loughran


On 18/10/11 10:48, Petru Dimulescu wrote:

Hello,

I wonder how do you guys see the problem of automatic node discovery:
having, for instance, a couple of hadoops, with no configuration
explicitly set whatsoever, simply discover each other and work together,
like Gridgain does: just fire up two instances of the product, on the
same machine or on different machines in the same LAN, they will use
mulitcast or whatever to discover each other


you can use techniques like Bonjour to have hadoop services register 
themselves in DNS and locate that way, but things only need to discover 
the NN and JT and report in.


 and to be a part of a

self-discovered topology.



Topology inference is an interesting problem. Something purely for 
diagnostics could be useful.



Of course, if you have special network requirements you should be able
to specify undiscovarable nodes by IP or name but often grids are
installed on LANs and it should really be simpler.


In a production system I'd have a private switch and isolate things for 
bandwidth and security; this is why auto configuration is generally 
neglected. If it were to be added, it would go via Zookeeper, leaving 
only the zookeeper discovery problem. You can't rely on DNS or multicast 
IP here as it doesn't always work in virtualised environments.




Namenodes are a bit different, they should use safer machines, I'm
basically talking about datanodes here, but still I wonder how hard can
it be to have self-assigned namenodes, maybe replicated automatically on
several machines, unless one specific namenode is explicitly set via xml
configuration.


I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns 
and as automatic replication isn't there it's a non-issue.


With fixed NN and JT entries in the DNS table, anything can come up in 
the LAN and talk to them unless you set up the master nodes with lists 
of things you trust.




Also, the ssh passwordless thing is so awkward. If you have a network of
hadoop that mutually discover each other there is really no need for
this passwordless ssh requirement. This is more of a system
administrator aspect, if sysadmins want to automatically deploy or start
a program on 5000 machines they often have the toolsskills to do that,
it should not be a requirement.


It's not a requirement, there are other ways to deploy. Large clusters 
tend to use cluster management tooling that keeps the OS images 
consistent, or you can use more devops-centric tooling (inc Apache 
Whirr) to roll things out.

Re: execute hadoop job from remote web application

2011-10-18 Thread Steve Loughran


On 18/10/11 11:40, Oleg Ruchovets wrote:

Hi , what is the way to execute hadoop job on remote cluster. I want to
execute my hadoop job from remote web  application , but I didn't find any
hadoop client (remote API) to do it.

Please advice.
Oleg



the Job class lets you build up and submit jobs from any java process 
that has RPC access to the Job Tracker

Re: hadoop knowledge gaining

2011-10-10 Thread Steve Loughran


On 07/10/11 15:25, Jignesh Patel wrote:

Guys,
I am able to deploy the first program word count using hadoop. I am interesting 
exploring more about hadoop and Hbase and don't know which is the best way to 
grasp both of them.

I have hadoop in action but it has older api.


Actually the API covered in the 2nd edition is pretty much the one in 
widest use. The newer API is better, but is only as complete in hadoop 
0.21 and later, which aren't yet in wide use



I do also have Hbase definitive guide which I have not started exploring.


Think of a problem, get some data, go through the books. Learning more 
about statistics and datamining is what you really need to learn, more 
than just the hadoop APIs


-steve

Re: FileSystem closed

2011-09-30 Thread Steve Loughran


On 29/09/2011 18:02, Joey Echeverria wrote:

Do you close your FileSystem instances at all? IIRC, the FileSystem
instance you use is a singleton and if you close it once, it's closed
for everybody. My guess is you close it in your cleanup method and you
have JVM reuse turned on.



I've hit this in the past. In 0.21+ you can ask for a new instance 
explicity.


For 0.20.20x, set fs.hdfs.impl.disable.cache to true in the conf, and 
new instances don't get cached.

Re: Hadoop performance benchmarking with TestDFSIO

2011-09-29 Thread Steve Loughran


On 28/09/11 22:45, Sameer Farooqui wrote:

Hi everyone,

I'm looking for some recommendations for how to get our Hadoop cluster to do
faster I/O.

Currently, our lab cluster is 8 worker nodes and 1 master node (with
NameNode and JobTracker).

Each worker node has:
- 48 GB RAM
- 16 processors (Intel Xeon E5630 @ 2.53 GHz)
- 1 Gb Ethernet connection


Due to company policy, we have to keep the HDFS storage on a disk array. Our
SAS connected array is capable of 6 Gb (768 MB) for each of the 8 hosts. So,
theoretically, we should be able to get a max of 6 GB simultaneous reads
across the 8 nodes if we benchmark it.


missing the point on Hadoop there; you will end up getting the bandwidth 
of the HDD most likely to fail next, copy replication is overkill and 
you will reach limits on scale both technical (SAN scalability) and 
financial.




Our disk array is presenting each of the 8 nodes with a 21 TB LUN. The LUN
is RAID-5 across 12 disks on the array. That LUN is partitioned on the
server into 6 different devices like this:






The file system type is ext3.


set noatime



So, when we run TestDFSIO, here are the results:

*++ Write ++*

hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -write

-nrFiles 80 -fileSize 1

11/09/27 18:54:53 INFO fs.TestDFSIO: - TestDFSIO - : write
11/09/27 18:54:53 INFO fs.TestDFSIO:Date  time: Tue Sep 27
18:54:53 EDT 2011
11/09/27 18:54:53 INFO fs.TestDFSIO:Number of files: 80
11/09/27 18:54:53 INFO fs.TestDFSIO: Total MBytes processed: 80
11/09/27 18:54:53 INFO fs.TestDFSIO:  Throughput mb/sec: 8.2742240008678
11/09/27 18:54:53 INFO fs.TestDFSIO: Average IO rate mb/sec:
8.288116455078125
11/09/27 18:54:53 INFO fs.TestDFSIO:  IO rate std deviation:
0.3435565217052116
11/09/27 18:54:53 INFO fs.TestDFSIO: Test exec time sec: 1427.856

So, throughput across all 8 nodes is 8.27 * 80 = 661 MB per second.


*++ Read ++*

hadoop jar /usr/lib/hadoop/hadoop-test-0.20.2-CDH3B4.jar TestDFSIO -read

-nrFiles 80 -fileSize 1

11/09/27 19:43:12 INFO fs.TestDFSIO: - TestDFSIO - : read
11/09/27 19:43:12 INFO fs.TestDFSIO:Date  time: Tue Sep 27
19:43:12 EDT 2011
11/09/27 19:43:12 INFO fs.TestDFSIO:Number of files: 80
11/09/27 19:43:12 INFO fs.TestDFSIO: Total MBytes processed: 80
11/09/27 19:43:12 INFO fs.TestDFSIO:  Throughput mb/sec:
5.854318503905489
11/09/27 19:43:12 INFO fs.TestDFSIO: Average IO rate mb/sec:
5.96372652053833
11/09/27 19:43:12 INFO fs.TestDFSIO:  IO rate std deviation:
0.9885505979030621
11/09/27 19:43:12 INFO fs.TestDFSIO: Test exec time sec: 2055.465


So, throughput across all 8 nodes is 5.85 * 80 = 468 MB per second.


*Question 1:* Why are the reads and writes so much slower than expected? Any
suggestions about what can be changed? I understand that RAID-5 backed disks
are an unorthodox configuration for HDFS, but has anybody successfully done
this? If so, what kind of results did you see?






Also, we detached the 8 nodes from the disk array and connected each of them
to 6 local hard drives for testing (w/ ext4 file system). Then we ran the
same read TestDFSIO and saw this:

11/09/26 20:24:09 INFO fs.TestDFSIO: - TestDFSIO - : read
11/09/26 20:24:09 INFO fs.TestDFSIO:Date  time: Mon Sep 26
20:24:09 EDT 2011
11/09/26 20:24:09 INFO fs.TestDFSIO:Number of files: 80
11/09/26 20:24:09 INFO fs.TestDFSIO: Total MBytes processed: 80
11/09/26 20:24:09 INFO fs.TestDFSIO:  Throughput mb/sec:
13.065623285187982
11/09/26 20:24:09 INFO fs.TestDFSIO: Average IO rate mb/sec:
15.160531997680664
11/09/26 20:24:09 INFO fs.TestDFSIO:  IO rate std deviation:
8.000530562022949
11/09/26 20:24:09 INFO fs.TestDFSIO: Test exec time sec: 1123.447


So, with local disks, reads are about 1 GB per second across the 8 nodes.
Much faster!


Much lower cost per TB too. Orders of magnitude lower.



With 6 local disks, writes performed the same though:

11/09/26 19:49:58 INFO fs.TestDFSIO: - TestDFSIO - : write
11/09/26 19:49:58 INFO fs.TestDFSIO:Date  time: Mon Sep 26
19:49:58 EDT 2011
11/09/26 19:49:58 INFO fs.TestDFSIO:Number of files: 80
11/09/26 19:49:58 INFO fs.TestDFSIO: Total MBytes processed: 80
11/09/26 19:49:58 INFO fs.TestDFSIO:  Throughput mb/sec:
8.573949802610528
11/09/26 19:49:58 INFO fs.TestDFSIO: Average IO rate mb/sec:
8.588902473449707
11/09/26 19:49:58 INFO fs.TestDFSIO:  IO rate std deviation:
0.3639466752546032
11/09/26 19:49:58 INFO fs.TestDFSIO: Test exec time sec: 1383.734


Write throughput across the cluster was 685 MB per second.


Writes get streamed to multiple HDFS nodes for redundancy; you've got 
the bandwidth + network overhead and 3x the data.



Options
 -stop using HDFS on the SAN, it's the wrong approach. Mount the SAN 
directly and use file:// URLs, let the SAN do the networking and 
redundancy.
 -buy some local HDDs at least for all the temp data: logs, overspill

Re: Is SAN storage is a good option for Hadoop ?

2011-09-29 Thread Steve Loughran


On 29/09/11 13:28, Brian Bockelman wrote:


On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:


Hi,

I want to know can we use SAN storage for Hadoop cluster setup ?
If yes, what should be the best pratices ?

Is it a good way to do considering the fact the underlining power of Hadoop
is co-locating the processing power (CPU) with the data storage and thus it
must be local storage to be effective.
*But also, is it better to say “local is better” in the situation where I
have a single local 5400 RPM IDE drive, which  would be dramatically slower
than SAN storage striped  across many drives spinning at 10k RPM and
accessed via fiber channel ?*


Hi Praveenesh,

Two things:
1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the 
high-end SAN is going to win.  That's often false comparison: the question is often What can 
I buy for $50k?.  In that case (setting aside organizational politics), you can buy more 
spindles in the traditional Hadoop setup than for the SAN.
   - Also, if you're latency limited, you're likely working against yourself.  
The best thing I ever did for my organization was make our software work just 
as well with 100ms latency as with 1ms latency.
2) As Paul pointed out, you have to ask yourself whether the SAN is shared or 
dedicated.  Many SANs don't have the ability to strongly partition workloads 
between users..

Brian



One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on 
MS TerraServer, while [Jiang08] provides evidence that entry level 
FibreChannel storage is less reliable than SATA due to interconnects.


Anyone who criticises the NameNode for being a SPOF and relies on a SAN 
instead is missing something obvious.


[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?

Re: difference between development and production platform???

2011-09-28 Thread Steve Loughran


On 28/09/11 04:19, Hamedani, Masoud wrote:

Special Thanks for your help Arko,

You mean in Hadoop, NameNode, DataNodes, JobTracker, TaskTrackers and all
the clusters should deployed on Linux machines???
We have lots of data (on windows OS) and code (written in C#) for data
mining, we wana to use Hadoop and make connection between
our existing systems and programs with it.
as you mentioned we should move all of our data to Linux systems, and
execute existing C# codes in Linux and only use windows for
development same as before.
Am I right?



What is really meant is nobody runs hadoop at scale on Windows.

Specifically
 -there's an expectation that there is a unix API you can exec
 -some of the operations (e.g. how programs are exec()'d) are optimised 
for linux

 -everyone tests on 50+ node clusters on Linux.

Why Linux? Stable, low cost. And you can install it on your 
laptop/desktop and develop there too.



Because everyone uses Linux (or possibly a genuine Unix system like 
Solaris), problems encountered in real systems get found on Linux and 
fixed.


If you want to run a production Hadoop cluster on Windows, you are free 
to do so. Just be aware that you may be the first person to do so at 
scale, so you get to find problems first, you get to file the bugs -and 
because you are the only person with these problems and the ability to 
replicate them- you get to fix them.


Nobody is going to say oh, this patch is for Windows only use, we will 
reject it -at least provided it doesn't have adverse effects on 
Linux/Unix. It's just that nobody else publicly runs Hadoop on Windows. 
A key step 1 will be cross compiling all the native code to Windows, 
which on 0.23+ also means protocol buffers. Enjoy.


Where you will find problems is that even on Win64, Hadoop can't 
directly load or run C# APPs or anything else written to compile against 
their managed runtime (I forget it's name). You will have to bridge via 
streaming, and take a performance hit.


You could also try running the C# code under Mono on Linux; it may or 
may not work. Again, you get to find out and fix the problems -this time 
with the Mono project.


-Steve

Re: hadoop question using VMWARE

2011-09-28 Thread Steve Loughran


On 28/09/11 08:37, N Keywal wrote:

For example:
- It's adding two layers (windows  linux), that can both fail, especially
under heavy workload (and hadoop is built to use all the resources
available). They will need to be managed as well (software upgrades,
hardware support...), it's an extra cost.
- These two layers will use randomly the different resources (HDD,
CPU,network) making issues and performance analysis more complicated.
- there will be a real performance impact. It's depends on what you do, and
how is configured Windows  vmware, but on my non optimized laptop I lose
more than 50%. VMWare claims 15% max, but it's without Windows (using direct
ESX)



Where you take a big hit is in disk IO, as what your OS thinks is a disk 
with sequentially stored files is just a single file in the host OS that 
may be scattered round the real HDD. Disk IO goes through too many 
layers. It's often faster to NFS mount the real HDD.


For compute intensive work, the performance hit isn't so bad, at least 
provided you don't swap.



- Last time I checked (a few months ago), vmware was not able to use all the
core  memory of medium sized servers.


Same with VirtualBox, which I like because it is lighter weight.

I use VMs because the infrastructure provides it; things like ElasticMR 
from AWS also offer it. Your code may be slower, but what you get is the 
ability to bring up clusters on a pay-per-hour basis, and the ability to 
vary the #of machines based on the workload/execution plan. If you can 
compensate for the IO hit by renting four more servers, you may still 
come out ahead.


http://www.slideshare.net/steve_l/farming-hadoop-inthecloud

Re: Environment consideration for a research on scheduling

2011-09-26 Thread Steve Loughran


On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:

If you are starting from scratch with no prior Hadoop install experience I 
would configure stand-alone, migrate to pseudo distributed and then to fully 
distributed verifying functionality at each step by doing a simple word count 
run. Also, if you don't mind using the CDH distribution then SCM / their rpms 
will greatly simplify both the bin installs as well as the user creation.

Your VM route will most likely work but I can imagine the amount of hiccups 
during migration from that to the real cluster will not make it worth your time.

Matt

-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com]
Sent: Friday, September 23, 2011 10:00 AM
To: common-user@hadoop.apache.org
Subject: Environment consideration for a research on scheduling

Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
librarieshttp://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567).
Would you suggest any other version?


I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL 
and CentOS are the platform of choice in the server side.





In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.


The main schedulers used by Y! and FB are fairly tuned for their 
workloads, and not apparently something you'd want to play with. There 
is at least one other scheduler in the contribs/ dir to play with.


the other thing about scheduling is that you may have a faster 
development cycle if, instead of working on a real cluster, you simulate 
it and multiples of real time; using stats collected from your own 
workload by way of the gridmix2 tools. I've never done scheduling work, 
but think there's some stuff there to do that. if not, it's a possible 
contribution.


Be aware that the changes in 0.23+ will change resource scheduling; this 
may be a better place to do development with a plan to deploy in 2012. 
Oh, and get on the mapreduce lists, esp, the -dev list, to discuss issues




The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



I have no idea what that means but am not convinced that reading an 
email forces me to comply with a different country's rules

Re: Can we replace namenode machine with some other machine ?

2011-09-22 Thread Steve Loughran


On 22/09/11 05:42, praveenesh kumar wrote:

Hi all,

Can we replace our namenode machine later with some other machine. ?
Actually I got a new  server machine in my cluster and now I want to make
this machine as my new namenode and jobtracker node ?
Also Does Namenode/JobTracker machine's configuration needs to be better
than datanodes/tasktracker's ??



1. I'd give it lots of RAM - holding data about many files, avoiding 
swapping, etc.


2. I'd make sure the disks are RAID5, with some NFS-mounted FS that the 
secondary namenode can talk to. avoids risk of loss of the index, which, 
if it happens, renders your filesystem worthless. If I was really 
paranoid I'd have twin raid controllers with separate connections to 
disk arrays in separate racks, as [Jiang2008] shows that interconnect 
problems on disk arrays can be higher than HDD failures.


3. if your central switches are at 10 GbE, consider getting a 10GbE NIC 
and hooking it up directly -this stops the network being the bottleneck, 
though it does mean the server can have a lot more packets hitting it, 
so putting more load on it.


4. Leave space for a second CPU and time for GC tuning.


JT's are less important; they need RAM but use HDFS for storage. If your 
cluster is small, NN and JT can be run locally. If you do this, set up 
DNS to have two hostnames to point to same network address. Then if you 
ever split them off, everyone whose bookmark says http://jobtracker 
won't notice


Either way: the NN and the JT are the machines whose availability you 
care about. The rest is just a source of statistics you can look at later.


-Steve



[Jiang2008] Are disks the dominant contributor for storage failures?: A 
comprehensive study of storage subsystem failure characteristics. ACM 
Transactions on Storage.

Re: Can we replace namenode machine with some other machine ?

2011-09-22 Thread Steve Loughran


On 22/09/11 17:13, Michael Segel wrote:


I agree w Steve except on one thing...

RAID 5 Bad. RAID 10 (1+0) good.

Sorry this goes back to my RDBMs days where RAID 5 will kill your performance 
and worse...



sorry, I should have said RAID =5. The main thing is you don't want the 
NN data lost. ever

Re: risks of using Hadoop

2011-09-21 Thread Steve Loughran


On 20/09/11 22:52, Michael Segel wrote:


PS... There's this junction box in your machine room that has this very large 
on/off switch. If pulled down, it will cut power to your cluster and you will 
lose everything. Now would you consider this a risk? Sure. But is it something 
you should really lose sleep over? Do you understand that there are risks and 
there are improbable risks?


We follow the @devops_borat Ops book and have a post-it-note on the 
switch saying not a light switch

Re: risks of using Hadoop

2011-09-21 Thread Steve Loughran


On 21/09/11 11:30, Dieter Plaetinck wrote:

On Wed, 21 Sep 2011 11:21:01 +0100
Steve Loughranste...@apache.org  wrote:


On 20/09/11 22:52, Michael Segel wrote:


PS... There's this junction box in your machine room that has this
very large on/off switch. If pulled down, it will cut power to your
cluster and you will lose everything. Now would you consider this a
risk? Sure. But is it something you should really lose sleep over?
Do you understand that there are risks and there are improbable
risks?


We follow the @devops_borat Ops book and have a post-it-note on the
switch saying not a light switch


:D


Also we have a backup 4-port 1Gbe linksys router for when the main 
switch fails. The biggest issue these days is that since we switched the 
backplane to Ethernet over Powerline a power outage leads to network 
partitioning even when the racks have UPS.



see also http://twitter.com/#!/DEVOPS_BORAT

Re: risks of using Hadoop

2011-09-19 Thread Steve Loughran


On 18/09/11 02:32, Tom Deutsch wrote:

Not trying to give you a hard time Brian - we just have different 
users/customers/expectations on us.


Tom, I suggest you read Apache goes realtime at facebook and consider 
how you could adopt those features -and how to contribute them back to 
the ASF. Certainly I'd like to see their subcluster placement policy in 
the codebase.



For anyone doing batch work, I'd take the NN outage problem as an 
intermittent event that happens less often than OS upgrades -it's just 
something you should expect and test before your system goes live -make 
sure your secondary NN is working and you know how to handle a restart.


Regarding the original discussion, 10-15 nodes has enough machines that 
the loss of one or two should be sustainable; with smaller clusters you 
get less benefit from replication (as each failing server loses a higher 
percentage of the blocks), but the probability of server failure is much 
less.


You can fit everything into a single rack with a ToR switch running at 1 
gigabit through the rack, 10 Gigabit if the servers have it on the 
mainboards and you can afford the difference, as it may mitigate some of 
the impact of server loss. Do think about expansion here; at least have 
enough ports for the entire rack, and the option of multiple 10 GbE 
interconnects to any other racks you may add later. Single switch 
clusters don't need any rack topology scripts, so you can skip on one 
bit of setup.


As everyone says, you need to worry about namenode failure. You could 
put the secondary namenode on the same machine as the job tracker, and 
have them both write to NFS mounted filesystems. The trick in a small 
cluster is to use some (more than one) of the workers' disk space as 
those NFS mount points.



Risks
 -security; you may want to isolate the cluster from the rest of your 
intranet
 -security: if I could run code on your cluster I could probably get at 
various server ports and read what I wanted. As all MR jobs are running 
code in the cluster, you have to trust people coding at the Java layer. 
If it's pig or hive jobs, life is simpler.
 -data integrity. Get ECC memory, monitor disks aggressively and take 
them offline if you think they are playing up. Run SATA VERIFY commands 
against the machines (in the sg3_utils package).
 -DoS to the rest of the Intranet. Bad code on a medium to large 
cluster can overload the rest of your network simply by making too many 
DNS requests, let alone lots of remote HTTP operations. This should not 
be a risk for smaller clusters.
 -Developers writing code that doesn't scale. You don't have to worry 
about this in a small cluster, but as you scale you will find use of 
choke points (JT counters, shared remote filestores) may cause problems. 
Even excessive logging can be trouble.


-New feature for Ops: more monitoring to learn about. While the NN 
uptime matters, the worker nodes are less important. Don't have the team 
sprint to deal with a lost worker node. That said, for a small cluster 
I'd have a couple of replacement disks around, as the loss of disk would 
have more impact on total capacity.



I've been looking at Hadoop Data Integrity, and now have a todo list 
based on my findings

http://www.slideshare.net/steve_l/did-you-reallywantthatdata

Because your cluster is small, you won't overload your NN even with 
small blocks, or the JT with jobs finishing too fast for it to keep up 
with, so you can use smaller blocks, which should improve data integrity.


Otherwise, the main risk, as people note is unrealistic expectations. 
Hadoop is not a replacement for a database with ACID transaction 
requirements, even reads are slower than indexed tables. What it is good 
for is very-low-cost storage of large amounts of low-value data, and as 
a platform for the layers above.


-Steve

Re: risks of using Hadoop

2011-09-19 Thread Steve Loughran


On 18/09/11 03:37, Michael Segel wrote:





2) Data Loss.
You can mitigate this as well. Do I need to go through all of the options and 
DR/BCP planning? Sure there's always a chance that you have some Luser who does 
something brain dead. This is true of all databases and systems. (I know I can 
probably recount some of IBM's Informix and DB2 having data loss issues. But 
that's a topic for another time. ;-)



That raises one more point. Once your cluster grows it's hard to back it 
up except to other Hadoop clusters. If you want survive loss-of-site 
events (power, communications) then you'll need to exchange copies of 
the high-value data between physically remote clusters. But you may not 
need to replicate at 3x remotely, because it's only backup data.



-steve

Re: Hadoop with Netapp

2011-09-01 Thread Steve Loughran


On 25/08/11 08:20, Sagar Shukla wrote:

Hi Hakan,

 Please find my comments inline in blue :



-Original Message-
From: Hakan (c)lter [mailto:hakanil...@gmail.com]
Sent: Thursday, August 25, 2011 12:28 PM
To: common-user@hadoop.apache.org
Subject: Hadoop with Netapp



Hi everyone,



We are going to create a new Hadoop cluster in our company, i have to get some 
advises from you:



1. Does anyone have stored whole Hadoop data not on local disks but on Netapp 
or other storage system? Do we have to store datas on local disks, if so is it 
because of performace issues?



sagar: Yes, we were using SAN LUNs for storing Hadoop data. SAN works faster 
than NAS in terms of performance while writing the data to the storage. Also SAN LUNs 
can be auto-mounted while booting up the system.


Silly question: why? SANs are SPOFs (Gray  van Ingen, MS, 2005; SAN 
responsible for 11% of terraserver downtime).


Was it because you had the rack and wanted to run Hadoop, or did you 
want a more agile cluster? Because it's going to increase your cost of 
storage dramatically, which means you pay more per TB, or end up with 
less TB of storage. I wouldn't go this way for a dedicated Hadoop 
cluster. For a multi-use cluster, it's a different story







2. What do you think about running Hadoop nodes in virtual (VMware) servers?



sagar: If high speed computing is not a requirement for you then Hadoop nodes 
in VM environment could be a good option, but one other slight drawback is when the 
VM crashes recovery of the in-memory data would be gone. Hadoop takes care of some 
amount of failover, but there is some amount of risk involved and requires good HA 
building capabilities.



I do it for dev and test work, and for isolated clusters in a shared 
environment.


-for CPU bound stuff, it actually works quite well, as there's no 
significant overhead


-for HDD access, reading from the FS, writing to the FS and to store 
transient spill data you take a tangible performance hit. That's OK if 
you can afford to wait or rent a few extra CPUs -and your block size is 
such that those extra servers can help out -which may be in the map 
phase more than the reduce phase



Some Hadoop-ish projects -Stratosphere from TuB in particular- are 
designed for VM infrastructure so come up with execution plans to use 
VMs efficiently.


-steve

Re: Turn off all Hadoop logs?

2011-09-01 Thread Steve Loughran


On 29/08/11 20:31, Frank Astier wrote:

Is it possible to turn off all the Hadoop logs simultaneously? In my unit 
tests, I don’t want to see the myriad “INFO” logs spewed out by various Hadoop 
components. I’m using:

   ((Log4JLogger) DataNode.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) LeaseManager.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) FSNamesystem.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) DFSClient.LOG).getLogger().setLevel(Level.OFF);
 ((Log4JLogger) Storage.LOG).getLogger().setLevel(Level.OFF);

But I’m still missing some loggers...



you need a log4j.properties file on the CP that doesn't log so much. I 
do this by


 -removing /logj4.properties from the Hadoop jars in our (private) jar 
repository

 -having custom log4.properties files in the test/ source trees

You could also start junit with the right log4j properties to point it 
at a custom log4j file. I forget what that property is.

Re: Namenode Scalability

2011-08-17 Thread Steve Loughran


On 17/08/11 08:48, Dieter Plaetinck wrote:

Hi,

On Wed, 10 Aug 2011 13:26:18 -0500
Michel Segelmichael_se...@hotmail.com  wrote:


This sounds like a homework assignment than a real world problem.


Why? just wondering.


The question  proposed a data rate comparable with Yahoo, Google and 
Facebook --yet it was ingress rather than egress, which was even more 
unusual. You'd have to be doing a web-scale search engine to need that 
data rate -and if you were doing that you need to know a lot more about 
how Hadoop works (i.e. the limited role of the NN). You'd also have to 
addressed the entire network infrastructure, the costs of the work on 
your external system, DNS load, power budget. Oh, and the fact that 
unless you were processing discarding those PB/day at the rate of 
ingress, you'd need to add a new Hadoop cluster at a rate of 1 
cluster/month, which is not only expensive, I don't think datacentre 
construction rates could handle it, even if your server vendor had set 
up a construction/test pipeline to ship down an assembled and test 
containerised cluster every few weeks (which we can do, incidentally :)





I guess people don't race cars against trains or have two trains
traveling in different directions anymore... :-)


huh?


Different Homework questions.

Re: hadoop cluster mode not starting up

2011-08-16 Thread Steve Loughran


On 16/08/11 11:02, A Df wrote:

Hello All:

I used a combination of tutorials to setup hadoop but most seems to be using 
either an old version of hadoop or only using 2 machines for the cluster which 
isn't really a cluster. Does anyone know of a good tutorial which setups 
multiple nodes for a cluster?? I already looked at the Apache website but it 
does not give sample values for the conf files. Also each set of tutorials seem 
to have a different set of parameters which they indicate should be changed so 
now its a bit confusing. For example, my configuration sets a dedicate 
namenode, secondary namenode and 8 slave nodes but when I run the start command 
it gives an error. Should I install hadoop to my user directory or on the root? 
I have it in my directory but all the nodes have a central file system as 
opposed to distributed so whatever I do on one node in my user folder it affect 
all the others so how do i set the paths to ensure that it uses a distributed 
system?

For the errors below, I checked the directories and the files are there. Am I 
not sure what went wrong and how to set the conf to not have central file 
system. Thank you.

Error message
CODE
w1153435@n51:~/hadoop-0.20.2_cluster  bin/start-dfs.sh
bin/start-dfs.sh: line 28: 
/w1153435/hadoop-0.20.2_cluster/bin/hadoop-config.sh: No such file or directory
bin/start-dfs.sh: line 50: 
/w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemon.sh: No such file or directory
bin/start-dfs.sh: line 51: 
/w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh: No such file or directory
bin/start-dfs.sh: line 52: 
/w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh: No such file or directory
CODE


there's  No such file or directory as 
/w1153435/hadoop-0.20.2_cluster/bin/hadoop-daemons.sh




I had tried running this command below earlier but also got problems:
CODE
w1153435@ngs:~/hadoop-0.20.2_cluster  export 
HADOOP_CONF_DIR=${HADOOP_HOME}/conf
w1153435@ngs:~/hadoop-0.20.2_cluster  export 
HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
w1153435@ngs:~/hadoop-0.20.2_cluster  ${HADOOP_HOME}/bin/slaves.sh mkdir -p 
/home/w1153435/hadoop-0.20.2_cluster/tmp/hadoop
-bash: /bin/slaves.sh: No such file or directory
w1153435@ngs:~/hadoop-0.20.2_cluster  export 
HADOOP_HOME=/home/w1153435/hadoop-0.20.2_cluster
w1153435@ngs:~/hadoop-0.20.2_cluster  ${HADOOP_HOME}/bin/slaves.sh mkdir -p 
/home/w1153435/hadoop-0.20.2_cluster/tmp/hadoop
cat: /conf/slaves: No such file or directory
CODE

there's  No such file or directory as /conf/slaves because you set 
HADOOP_HOME after setting the other env variables, which are expanded at 
set-time, not run-time.

Re: Help on DFSClient

2011-08-13 Thread Steve Loughran


On 06/08/2011 20:41, jagaran das wrote:

I am keeping a Stream Open and writing through it using a multithreaded 
application.
The application is in a different box and I am connecting to NN remotely.

I was using FileSystem and getting same error and now I am trying DFSClient and 
getting the same error.

When I am running it via simple StandAlone class, it is not throwing any error 
but when i put that in my Application, it is throwing this error.



That's just logging at trace level where the LeaseChecker was created. 
Your app is set up to log at TRACE, and you are getting more diagnostics 
than normal.



06Aug2011 12:29:24,345 DEBUG [listenerContainer-1] (DFSClient.java:1115) - Wait 
for lease checker to terminate
06Aug2011 12:29:24,346 DEBUG 
[LeaseChecker@DFSClient[clientName=DFSClient_280246853, ugi=jagarandas]: 
java.lang.Throwable: for testing
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181)
at org.apache.hadoop.util.Daemon.init(Daemon.java:38)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:513)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:497)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:442)
at 
com.apple.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:74)
at 
com.apple.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:95)
at 
com.apple.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41)
at 
com.apple.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276)
at 
com.apple.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463)
at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322)
at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944)
at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868)
at java.lang.Thread.run(Thread.java:680)

Re: Namenode Scalability

2011-08-13 Thread Steve Loughran


On 10/08/2011 08:58, jagaran das wrote:

In my current project we  are planning to streams of data to Namenode (20 Node 
Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.


That's Gigabyte/s or Gigabit/s?



Few queries:

1. Can a single Namenode handle such high speed writes? Or it becomes 
unresponsive when GC cycle kicks in.


see below


2. Can we have multiple federated Name nodes  sharing the same slaves and then 
we can distribute the writes accordingly.


that won't solve your problem


3. Can multiple region servers of HBase help us ??


no



Please suggest how we can design the streaming part to handle such scale of 
data.


Data is written to datanodes, not namenodes. the NN is used to set up 
the write chain and then just tracks node health -the data does not go 
through it.


This changes your problem to one of
 -can the NN set up write chains at the speed you want, or do you need 
to throttle back the file creation rate by writing bigger files

 -can the NN handle the (file x block count) volumes you expect
 -what is the network traffic of the data ingress
 -what is the total bandwidth of the replication traffic combined with 
the data ingress traffic?

 -do you have enough disks for the data
 -do your HDDs have enough bandwidth?
 -do you want to do any work with the data, and what CPU/HDD/net load 
does this generate?

 -what impact will disk  datanode replication traffic have?
 -how much of the backbone will you have to allocated to the rebalancer.

A 1 PB/day, ignoring all network issues, you will reach the current 
documented HDFS limits within four weeks. What are you going to do then, 
or will you have processed it down?


I could imagine some experiments you could conduct against a namenode to 
see what its limits are, but there are lot of datacentre bandwidth and 
computation details you have to worry above and beyond datanode 
performance issues.


Like Michael says, 1 PB/day sounds like a homework project, especially 
if you haven't used hadoop at smaller scale. If it is homework, once 
you've done the work (and submitted it), it'd be nice to see the final 
paper.


If it is something you plan to take live, well, there are lots of issues 
to address, of which the NN is just one of the issues -and one you can 
test in advance. Ramping up the cluster with different loads will teach 
you more about the bottlenecks. Otherwise: there are people who know how 
to run Hadoop at scale, who, in exchange for money, will help you.


-steve

Re: Invalid link http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar during ivy download whiling mumak build.

2011-08-02 Thread Steve Loughran


On 30/07/11 06:30, arun k wrote:

Hi all !
  I have added the following code to build.xml and tried to build : $ant
package.
  I have also tried to remove removed the entire ivy2 (~/.ivy2/* ) directory
and rebuild but couldn't succeed.
  setproxy proxyhost=192.168.0.90 proxyport=8080
   proxyuser=ranam proxypassword=passwd  nonproxyhosts=xyz.svn.com
/
I get the error UNRESOLVED DEPENDENCIES.
I have attached the log file.


The artifact is there, so it's a proxy problem

export $ANT_OPTS = -Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 
-Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080


These don't set ant properties, they set JVM options, and do work for 
Hadoop builds

Re: The best architecture for EC2/Hadoop interface?

2011-08-02 Thread Steve Loughran


On 02/08/11 05:09, Mark Kerzner wrote:

Hi,

I want to give my users a GUI that would allow them to start Hadoop clusters
and run applications that I will provide on the AMIs. What would be a good
approach to make it simple for the user? Should I write a Java Swing app
that will wrap around the EC2 commands? Should I use some more direct EC2
API? Or should I use a web browser interface?

My idea was to give the user a Java Swing GUI, so that he gives his Amazon
credentials to it, and it would be secure because the application is not
exposed to the outside. Does this approach make sense?


1. I'm not sure that Java Swing GUI makes sense for anything anymore -if 
it ever did.


2. Have a look at what other people have done first before writing your 
own. Amazon provide something for their derivative of Hadoop, Elastic 
MR, I suspect KarmaSphere and others may provide UIs in front of it too.


the other thing is most big jobs are more than one operation, so you are 
a workflow world. Things like cascading pig and oozie help here, and if 
you can bring them up in-cluster, you can get a web UI.

Re: Submitting and running hadoop jobs Programmatically

2011-07-27 Thread Steve Loughran


On 27/07/11 05:55, madhu phatak wrote:

Hi
I am submitting the job as follows

java -cp
  
Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/*
com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv
kkk11fffrrw 1


My code to submit jobs (via a declarative configuration) is up online

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup

It's LGPL, but ask nicely and I'll change the header to Apache.

That code doesn't set up the classpath by pushing out more JARs (I'm 
planning to push out .groovy scripts instead), but it can also poll for 
job completion, take a timeout (useful in small test runs), and do other 
things. I currently mainly use it for testing

Re: Job progress not showing in Hadoop Tasktracker web interface

2011-07-21 Thread Steve Loughran


On 20/07/11 06:11, Teng, James wrote:

You can't run a hadoop job in eclipse, you have to set up an environment on 
linux system. Maybe you can try to install it on WMware linux system and run 
the job in pseudo-distributed system.




Actually you can bring up a MiniMRCluster in your JUnit test run (if 
hadop-core-test is on the Classpath) and run simple jobs against that. 
This is the standard way that Hadoop tests itself. It's not that high 
performing, doesn't scale out and can leak threads, but it's ideal for 
basic testing

Re: error of loading logging class

2011-07-21 Thread Steve Loughran


On 20/07/11 07:16, Juwei Shi wrote:

Hi,

We faced a problem of loading logging class when start the name node.  It
seems that hadoop can not find commons-logging-*.jar

We have tried other commons-logging-1.0.4.jar and
commons-logging-api-1.0.4.jar. It does not work!

The following are error logs from starting console:



I'd drop the -api file as it isn't needed, and as you say, avoid 
duplicate versions. Make sure that log4j is at the same point in the 
class hierarchy too (e.g in hadoop/lib)


to debug commons logging, tell it to log to stderr. It's useful in 
emergencies


-Dorg.apache.commons.logging.diagnostics.dest=STDERR

Re: Which release to use?

2011-07-19 Thread Steve Loughran


On 19/07/11 12:44, Rita wrote:

Arun,

I second Joeś comment.
Thanks for giving us a heads up.
I will wait patiently until 0.23 is considered stable.



API-wise, 0.21 is better. I know that as I'm working with 0.20.203 right 
now, and it is a step backwards.


Regarding future releases, the best way to get it stable is participate 
in release testing in your own infrastructure. Nothing else will find 
the problems unique to your setup of hardware, network and software

Re: Which release to use?

2011-07-17 Thread Steve Loughran


On 16/07/2011 16:53, Rita wrote:

I am curious about the IBM product BigInishgts. Where can we download it? It
seems we have to register to download it?



I think you have to pay to use it

Re: Cluster Tuning

2011-07-15 Thread Steve Loughran


On 08/07/2011 16:25, Juan P. wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints


take a look to see if its usually the same machine that's taking too 
long; test your HDDs to see if there are any signs of problems in the 
SMART messages. Then turn on speculation. It could be the problem with a 
slow mapper is caused by disk problems or an overloaded server.

Re: Which release to use?

2011-07-15 Thread Steve Loughran


On 15/07/2011 15:58, Michael Segel wrote:


Unfortunately the picture is a bit more confusing.

Yahoo! is now HortonWorks. Their stated goal is to not have their own 
derivative release but to sell commercial support for the official Apache 
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like smoke and 
mirrors...)
*DataStax


+ Amazon, indirectly, that do their own derivative work of some release 
of Hadoop (which version is it based on?)


I've used 0.21, which was the first with the new APIs and, with MRUnit, 
has the best test framework. For my small-cluster uses, it worked well. 
(oh, and I didn't care about security)

Re: Which release to use?

2011-07-15 Thread Steve Loughran


On 15/07/2011 18:06, Arun C Murthy wrote:

Apache Hadoop is a volunteer driven, open-source project. The contributors to 
Apache Hadoop, both individuals and folks across a diverse set of 
organizations, are committed to driving the project forward and making timely 
releases - see discussion on hadoop-0.23 with a raft newer features such as 
HDFS Federation, NextGen MapReduce and plans for HA NameNode etc.

As with most successful projects there are several options for commercial 
support to Hadoop or its derivatives.

However, Apache Hadoop has thrived before there was any commercial support 
(I've personally been involved in over 20 releases of Apache Hadoop and 
deployed them while at Yahoo) and I'm sure it will in this new world order.

We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', 
providing support to our users and to move it forward at a rapid rate.



Arun makes a good point which is that the Apache project depends on 
contributions from the community to thrive. That includes


 -bug reports
 -patches to fix problems
 -more tests
 -documentation improvements: more examples, more on getting started, 
troubleshooting, etc.


If there's something lacking in the codebase, and you think you can fix 
it, please do so. Helping with the documentation is a good start, as it 
can be improved, and you aren't going to break anything.


Once you get into changing the code, you'll end up working with the head 
of whichever branch you are targeting.


The other area everyone can contribute on is testing. Yes, Y! and FB can 
test at scale, yes, other people can test large clusters too -but nobody 
has a network that looks like yours but you. And Hadoop does care about 
network configurations. Testing beta and release candidate releases in 
your infrastructure, helps verify that the final release will work on 
your site, and you don't end up getting all the phone calls about 
something not working

Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran


On 06/07/11 11:43, Karthik Kumar wrote:

Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.



This is too vague a question. What do you mean process?. Scan through 
some logs looking for values? You could do that on a single machine if 
you weren't in a rush and you have enough disks, you'd just be very IO 
bound, and to be honest HDFS needs a minimum number of machines to 
become fault tolerant. Do complex matrix operations that use lots of RAM 
and CPU? You'll need more machines.


If your cluster has a blocksize of 512MB then a 3TB file fits into 
(3*1024*1024)/512 blocks: 6144. so you can't have more than 6144 
machines anyway -that's your theoretical maximum, even if your name is 
Facebook or Yahoo!


What you are looking for is something in between 10 and 6144, the exact 
number driven by
 -how much compute you need to do, and how fast you want it done 
(controls #of CPUs, RAM)

 -how much total HDD storage you anticipate needing
 -whether you want to do leading-edge GPU work (good performance on 
some tasks, but limited work per machine)


You can use benchmarking tools like gridmix3 to get some more data on 
the characteristics of your workload, which you can then take to your 
server supplier to say this is what we need, what can you offer? 
Otherwise everyone is just guessing.


Remember also that you can add more racks later, but you will need to 
plan ahead on datacentre space, power and -very importantly- how you are 
going to expand the networking. Life is simplest if everything fits into 
one rack, but if you plan to expand you need to have a roadmap of how to 
connect that rack to some new ones, which means adding fast interconnect 
between different top of rack switches. You also need to worry about how 
to get data in and out fast.



-Steve

Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran


On 06/07/11 11:43, Karthik Kumar wrote:

Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.



Actually, I've just thought of  simpler answer. 40. It's completely 
random, but if said with confidence it's as valid as any other answer to 
your current question.

Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran


On 06/07/11 13:18, Michel Segel wrote:

Wasn't the answer 42?  ;-P



42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the 
worker nodes



Looking at your calc...
You forgot to factor in the number of slots per node.
So the number is only a fraction. Assume 10 slots per node. (10 because it 
makes the math easier.)


I thought something was wrong. Then I thought of the server revenue and 
decided not to look that hard.

Re: error in reduce task

2011-06-27 Thread Steve Loughran


On 24/06/11 18:16, Niels Boldt wrote:

Hi,

I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop

Some of my jobs partially fails and in the error log I get output like

2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_00_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)

2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_00_0 copy failed:
attempt_201106231520_0190_m_00_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1




The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.


If the worker had disappeared of the net, you'd be more likely to see 
a NoRouteToHost



My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation

Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)


I'd assume that one machine in the cluster doesn't have an /etc/hosts 
entry to worker1, or that the DNS server is suffering under load. If you 
can, put the host lists into the /etc/hosts table instead of relying on 
DNS. If you do it on all machines, it avoids having to work out which 
one is playing up. That said, some better logging of which host is 
trying to make the connection would be nice

Re: Upgrading namenode/secondary node hardware

2011-06-17 Thread Steve Loughran


On 16/06/11 14:19, MilleBii wrote:

But if my Filesystem is up  running fine... do I have to worry at all or
will the copy (ftp transfer) of  hdfs will be enough.



I'm not going to make any predictions there as if/when things go wrong

 -you do need to shut down the FS before the move
 -you ought to get the edit logs replayed before the move
 -you may want to try experimenting with copying the namenode data and 
bringing up the namenode (without any datanodes connected to, so it 
comes up in safe mode), to make sure everything works.


I'd also worry that if you aren't familiar with the edit log, you may 
need to spend some time learning the subtle details of namenode 
journalling, replaying, backup and restoration, and what the secondary 
namenode does. It's easy to bring up a cluster and get overconfident 
that it works, right up to the moment it stops working. Experiment with 
your cluster's and teams' failure handling before you really need it




2011/6/16 Steve Loughranste...@apache.org


On 15/06/11 15:54, MilleBii wrote:


Thx.

#1 don't understand the edit logs remark.



well, that's something you need to work on as its the key to keeping your
cluster working. The edit log is the journal of changes made to a namenode,
which gets streamed to HDD and your secondary Namenode. After a NN restart,
it has to replay all changes since the last checkpoint to get its directory
structure up to date. Lose the edit log and you may as well reformat the
disks.

Re: Upgrading namenode/secondary node hardware

2011-06-16 Thread Steve Loughran


On 15/06/11 15:54, MilleBii wrote:

Thx.

#1 don't understand the edit logs remark.


well, that's something you need to work on as its the key to keeping 
your cluster working. The edit log is the journal of changes made to a 
namenode, which gets streamed to HDD and your secondary Namenode. After 
a NN restart, it has to replay all changes since the last checkpoint to 
get its directory structure up to date. Lose the edit log and you may as 
well reformat the disks.

Re: Upgrading namenode/secondary node hardware

2011-06-15 Thread Steve Loughran


On 14/06/11 22:01, MilleBii wrote:

I want/need to upgrade my namenode/secondary node hardware. Actually also
acts as one of the datanodes.

Could not find any how-to guides.
So what is the process to switch from one hardware to the next.

1. For HDFS data : is it just a matter of copying all the hdfs data from old
server to new server.


yes, put it in the same place on your HA storage and you may not even 
need to reconfigure it. If you didn't shut down the filesystem cleanly, 
you'll need to replay the edit logs.



2. what about the decommissioning procedure of data node, is it necessary in
that case ?


You shouldn't need to. This is no different from handling failover of a 
namenode, which you ought to try from time to time anyway, with two 
common tactics
 -have ready-to-go replacement servers with the same hostname/IP and 
shared storage
 -have ready-to-go replacement servers with different hostnames, then 
with your cluster management tools bounce the workers into a new 
configuration.



3.For MapRed:  need to change the master in cluster configuration files


I'd give the new boxes the same hostnames and IPAddresses as before, and 
nothing else will notice. And I recommend having good cluster management 
tooling anyway, of course.

Re: Hadoop on windows with bat and ant scripts

2011-06-14 Thread Steve Loughran


On 13/06/11 15:27, Bible, Landy wrote:

On 06/13/2011 07:52 AM, Loughran, Steve wrote:


On 06/10/2011 03:23 PM, Bible, Landy wrote:
I'm currently running HDFS on Windows 7 desktops.  I had to create a hadoop.bat 
that provided the same functionality of the shell scripts, and some Java 
Service Wrapper configs to run the DataNodes and NameNode as windows services.  
Once I get my system more functional I plan to do a write up about how I did 
it, but it wasn't too difficult.  I'd also like to see Hadoop become less 
platform dependent.



why? Do you plan to bring up a real Windows server datacenter to test it on?


Not a datacenter, but a large-ish cluster of desktops, yes.


Whether you like it or not, all the big Hadoop clusters run on Linux


I realize that, I use Linux wherever possible, much to the annoyance of my 
Windows only co-workers. However, for my current project, I'm using all the 
Windows 7 and Vista desktops at my site as a storage cluster.   The first idea 
was to run Hadoop on Linux in a VM in the background on each desktop, but that 
seemed like overkill.  The point here is to use the resources we have but 
aren't using, rather than buy new resources.  Academia is funny like that.


I understand. One trick my local university has done is to buy a set of 
servers with HDDs for their HDFS filestore, but also hook them up to 
their grid scheduler (condor? Torque?) so the existing grid jobs see a 
set of machines for their work, while the Job tracker sees a farm of 
worker nodes with local data. Some more work there on reporting 
busy-state to each job scheduler would be nice, so that the Task 
Trackers would say busy when running grid jobs, and vice-versa





   So far, I've been unable to make MapReduce work correctly.  The services 
run, but things don't work, however I suspect that this is due to DNS not 
working correctly in my environment.



yes, that's part of the anywhere you have to fix. Edit the host tables so that 
DNS and reverse DNS appears to work. That's 
c:\windows\system32\drivers\etc\hosts, unless on a win64 box it moves.


Why does Hadoop even care about DNS?   Every node checks in with the NameNode 
and JobTrackers, so they know where they are, why not just go pure IP based and 
forget DNS.   Managing the hosts file is a pain... even when you automate it, 
it just seems unneeded.


there's been some fixes in 0.21 and 0.22, but still there may be a 
tendency to look things up.


https://issues.apache.org/jira/browse/HADOOP-3426
https://issues.apache.org/jira/browse/HADOOP-7104

Hadoop doesn't like coming up on multi-homed servers or having separate 
in-cluster and long-haul hostnames. Yes, this all needs fixing. I think 
the reason it hasn't been fixed is that the big datacentres do have well 
configured networks, caching DNS servers in every worker node, etc, and 
all is well. It's the home networks and the less-consistently set up 
ones (mine, and perhaps yours) where the trouble shows up. We get to 
file the bugs and fix the problems.

Re: Hadoop on windows with bat and ant scripts

2011-06-13 Thread Steve Loughran


On 06/10/2011 03:23 PM, Bible, Landy wrote:

Hi Raja,

I'm currently running HDFS on Windows 7 desktops.  I had to create a hadoop.bat 
that provided the same functionality of the shell scripts, and some Java 
Service Wrapper configs to run the DataNodes and NameNode as windows services.  
Once I get my system more functional I plan to do a write up about how I did 
it, but it wasn't too difficult.  I'd also like to see Hadoop become less 
platform dependent.


why? Do you plan to bring up a real Windows server datacenter to test it 
on?



Java is supposed to be Write Once - Run Anywhere, but a lot of java projects 
seem to forget that.


Java can be x-platform, but you have to consider the problems of testing 
on hundreds of machines, the fact that even System.execute() behaves 
differently on different systems, the networking setup and behaviour of 
windows is very different from Unix, etc.


Whether you like it or not, all the big Hadoop clusters run on Linux, 
not just for the licensing costs, but because it is what Hadoop is 
tested on at those scales, so it becomes self-reinforcing. Same for the 
JVM: Sun's standard JVM, not JRockit or anything else. Again, in a large 
datacenter you will find all the corner cases where that runs anywhere 
claim changes to crashes one task tracker every hour.


OS/X and Windows support is very much there for development, though even 
there I'd recommend switching to a Linux laptop to reduce the surprises 
when you go to the real cluster. Allen W will note that Solaris works 
too, but even then differences between Linux and SunOS caused problems.


By having a de-facto agreement to focus on Linux as the back end, it 
lets the developers

* have a single platform to dev and test on
* worry about RPM and deb installers, not windows install/uninstall quirks.
* share ready-to-use Linux VM images (as Cloudera do) for people to play 
with.
* use the large cluster management tooling that exists for managing big 
Linux clusters (Kickstart, etc).


I think it's important is for the client-side code to work on windows, 
for job submission to be x-platform, but getting server-side code to 
work well on windows is a lot harder than people expect. The OS wasn't 
really written for it, the Java Service Wrappers have their own issues 
(both the Apache one, which is derived from Tomcat, and the other one), 
and it's not something I'd recommend to go near unless you really have 
no choice in the matter. I speak from experience.


Sorry.



  So far, I've been unable to make MapReduce work correctly.  The services run, 
but things don't work, however I suspect that this is due to DNS not working 
correctly in my environment.


yes, that's part of the anywhere you have to fix. Edit the host tables 
so that DNS and reverse DNS appears to work. That's 
c:\windows\system32\drivers\etc\hosts, unless on a win64 box it moves.

Re: NameNode heapsize

2011-06-13 Thread Steve Loughran


On 06/10/2011 05:31 PM, si...@ugcv.com wrote:



I would add more RAM for sure but there's hardware limitation. How if
the motherboard
couldn't support more than ... say 128GB ? seems I can't keep adding RAM
to resolve it.

compressed pointers, do u mean turning on jvm compressed reference ?
I didn't try that out before, how's your experience ?


JVMs top out at 64GB, I think, while compressed pointers only work on 
sun VMs when the heap is under 32GB. JRockit has better heap management, 
but as I was the only person to admit to using Hadoop on JRockit, I know 
you'd be on your own if you found problems there.


Unless your cluster is bigger than Facebooks, you have too many small files

Re: Hadoop on windows with bat and ant scripts

2011-06-13 Thread Steve Loughran


On 06/12/2011 03:01 AM, Raja Nagendra Kumar wrote:


Hi,

I see hadoop would need unix (on windows with Cygwin) to run.
It would be much nice if Hadoop gets away from the shell scripts though
appropriate ant scripts or with java Admin Console kind of model. Then it
becomes lighter for development.


The overhead of executing ant vs shell scripts would make things worse 
on Linux clusters. The cost of installing cygwin on developer desktops 
isn't that high.


There's been discussions for a better native library for Hadoop 
operations, but it would be biased towards Unix (POSIX file permissions, 
paths, etc).

Re: Why inter-rack communication in mapreduce slow?

2011-06-07 Thread Steve Loughran


On 06/06/2011 02:40 PM, John Armstrong wrote:

On Mon, 06 Jun 2011 09:34:56 -0400,dar...@ontrenet.com  wrote:

Yeah, that's a good point.




In fact, it almost makes me wonder if an ideal setup is not only to have
each of the main control daemons on their own nodes, but to put THOSE nodes
on their own rack and keep all the data elsewhere.


I'd give them 10Gbps connection to the main network fabric, as with any 
ingress/egress nodes whose aim in life is to get data into and out of 
the cluster. There's a lot to be said for fast nodes within the 
datacentre but not hosting datanodes, as that way their writes get 
scattered everywhere -which is what you need when loading data into HDFS.


You don't need separate racks for this, just more complicated wiring.

-steve

(disclaimer, my network knowledge generally stops at Connection Refused 
and No Route to Host messages)

Re: Hadoop Cluster Multi-datacenter

2011-06-07 Thread Steve Loughran


On 06/07/2011 06:07 AM, sanjeev.ta...@us.pwc.com wrote:

Hello,

I wanted to know if anyone has any tips or tutorials on howto install the
hadoop cluster on multiple datacenters


Nobody has come out and said they've built a single HDFS filesystem from 
multiple sites, primarly because the inter-site bandwidth/latency will 
be awful and there isn't any support for this in the topology model of 
Hadoop (there are some placeholders though).


You could set up an HDFS filesystem in each datacentre, and use symbolic 
links (or the forthcoming federation) to pull data in. There's no reason 
why you can't start up a job on Datacentre-1 that starts reading some of 
its data from DC-2, after which all the work will be datacentre-local.



Do you need ssh connectivity between the nodes across these data centers?


Depends on how you deploy Hadoop. You only need SSH if you use the 
built-in tooling; if you use large scale cluster management tools then 
it's a non-issue.

Re: NameNode is starting with exceptions whenever its trying to start datanodes

2011-06-07 Thread Steve Loughran


On 06/07/2011 10:50 AM, praveenesh kumar wrote:

The logs say


The ratio of reported blocks 0.9091 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.



not enough datanodes reported in, or they are missing data

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Steve Loughran


On 06/06/11 08:22, elton sky wrote:

hello everyone,

As I don't have experience with big scale cluster, I cannot figure out why
the inter-rack communication in a mapreduce job is significantly slower
than intra-rack.
I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not be
much contention at the switch, is it?


I don't know enough about these switches; I do hear stories about 
buffering and the like, and I also hear that a lot of switches don't 
always expect all the ports to light up simultaneously.


Outside hadoop, try setting up some simple bandwidth tests to measure 
inter-rack bandwidth: have every node on one rack try and talk to one on 
another at full rate.


Set up every node talking to every other node at least once, to make 
sure there aren't odd problems between two nodes, which can happen if 
one of the NICs is playing up.


Once you are happy that the basic bandwidth between servers is OK, then 
it's time to start worrying adding hadoop to the mix


-steve

Re: Starting a Hadoop job outside the cluster

2011-06-06 Thread Steve Loughran


My Job submit code is



http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/

something to run tool classes
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/ToolRunnerComponentImpl.java?revision=8590view=markup


something to integrate job submission with some pre-run sanity checks, 
and to optionally wait for the work to finish


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup

works remotely for short-lived jobs; if you submit something that may 
run over a weekend you don't normally want to block for it

Re: Identifying why a task is taking long on a given hadoop node

2011-06-05 Thread Steve Loughran


On 03/06/2011 12:24, Mayuresh wrote:

Hi,

I am really having a hard time debugging this. I have a hadoop cluster and
one of the maps is taking time. I checked the datanode logs and can see no
activity for around 10 minutes!



The usual cause here is imminent disk failure, as reads start to take 
longer and longer. look at your SMART disk logs, do some performance 
tests of all the drives

Re: java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal

2011-05-27 Thread Steve Loughran


On 05/26/2011 07:45 PM, subhransu wrote:

Hello Geeks,
  I am a new bee to use hadoop and i am currently installed hadoop-0.20.203.0
I am running the sample programs part of this package but getting this error

Any pointer to fix this ???

~/Hadoop/hadoop-0.20.203.0 788  bin/hadoop jar
hadoop-examples-0.20.203.0.jar sort
java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal
  at
org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:246)
  at java.lang.J9VMInternals.initializeImpl(Native Method)
  at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
  at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449)
  at org.apache.hadoop.mapred.JobClient.init(JobClient.java:437)
  at org.apache.hadoop.examples.Sort.run(Sort.java:82)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.examples.Sort.main(Sort.java:187)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at


you're running the IBM JVM.
https://issues.apache.org/jira/browse/HADOOP-7211

Go to the IBM web site and download their slightly-modified version of 
Hadoop that works with their JVM, or switch to the Sun JVM, which is the 
only one that Hadoop is rigorously tested on. Sorry.


-steve

Re: Hadoop and WikiLeaks

2011-05-23 Thread Steve Loughran


On 23/05/11 01:10, Edward Capriolo wrote:



Correct. But it is a place to discuss changing the content of
http://hadoop.apache.org which is what I am advocating.



Todd's going to fix it. I just copied and pasted in the newspaper quote: 
it's not that I wanted to make any statement whatsoever, I just slapped 
in what the paper said. That's one of the English papers that isn't 
allowed to say which soccer player has been playing away from home, or 
reprint any part of the twitter query 
http://twitter.com/#search?q=%23superinjunction . When Britain has 
secrets, see, they're shallow and meaningless.


For anyone who wants to see what the award looks like, here's Owen 
holding it

http://www.flickr.com/photos/steve_l/5560919533/in/set-72157626356732562

apparently people kept coming up to talk to Jakob when he was walking 
round the cube, though once he started to explain about distributed file 
systems and resilient application execution through idempotent 
operations executed near the data they always ran off.


-steve

Re: Exception in thread AWT-EventQueue-0 java.lang.NullPointerException

2011-05-17 Thread Steve Loughran


On 16/05/11 21:12, Lạc Trung wrote:

I'm using Hadoop-0.21.
---
hut.edu.vn



At the top, it's your code, so you get to fix it. The good thing about 
open source is you can go all the way in.


This is what I would do in the same situation
-Grab the 0.21 source JAR
-add it your IDE
-have a look at what line it's NPEing on
-work out what variable's being null would trigger that. Usually its 
whatever object has just had a method called on it.

-work out why that's null (usually some config/startup thing was missing).
-If its a repeatable problem, try adding some checks in advance, dump 
state to the console, etc.

-fix what is probably a setup problem

If it's some other problem (race condition etc), life is harder

-steve

Re: Cluster hard drive ratios

2011-05-06 Thread Steve Loughran


On 05/05/11 19:14, Matthew Foley wrote:

a node (or rack) is going down, don't replicate == DataNode Decommissioning.

This feature is available.  The current usage is to add the hosts to be decommissioned to the 
exclusion file named in dfs.hosts.exclude, then use DFSAdmin to invoke -refreshNodes.  
(Search for decommission in DFSAdmin source code.)  NN will stop using these servers as 
replication targets, and will re-replicate all their replicas to other hosts that are still in 
service.  The count of nodes that are in the process of being decommissioned is reported in the NN 
status web page.



I'm thinking more of don't overreact to 50 machines going offline by 
rebalancing -all copies whose replication count has just dropped by 1, 
not until the rack has been offline for 30 minutes.

Re: Cluster hard drive ratios

2011-05-05 Thread Steve Loughran


On 04/05/11 19:59, Matt Goeke wrote:

Mike,

Thanks for the response. It looks like this discussion forked on the CDH
list so I have two different conversations now. Also, you're dead on
that one of the presentations I was referencing was Ravi's.

With your setup I agree that it would have made no sense to go the 2.5
drive route given it would have forced you into the 500-750GB SATA
drives and all it would allow is more spindles but less capacity at a
higher cost. The servers we have been considering are actually the
R710's so dual hexacore with 12 spindles of actual capacity is more of a
1:1 in terms of cores to spindles vs the 2:1 I have been reviewing. My
original issue attempted to focus more around at what point do you
actually see a plateau in write performance of cores:spindles but since
we are headed that direction anyway it looks like it was more to sate
curiosity than driving specifications.


some people are using this as it gives best storage density. You can 
also go for single hexacore servers as in a big cluster the savings 
there translate into even more storage. It all depends on the application.



As to your point, I forgot to include the issue of rebalancing in the
original email but you are absolutely right. That was another major
concern especially as we would get closer to filling capacity of a 24TB
box. I think the original plan was bonded GBe but I think our
infrastructure team has told us 10GBe would be standard.



1. If you want to play with bonded GBe then I have some notes I can send 
you -its harder than you think.
2. I don't know anyone who is running 10 GBe + Hadoop, though I see 
hints that StumbleUpon are doing this with Arista switches. You'd have 
to have a very chatty app or 10GBe on the mainboard to justify it.
3. I do know of installations with 24TB HDD and GBe, yes, the overhead 
of a node failure is higher. But with less nodes, P(failure) may be 
lower. The big fear is loss-of-rack, which can come from ToR switch 
failure or from network config errors. Hadoop isn't partition aware  
will treat a rack outage as the loss of 40+ servers, try to replicate 
all that data, and that's when you're in trouble (look at the AWS EBS 
outage for an example cascade failure).
4. There are JIRA issues for better handling of drive failure, including 
hotswapping and rebalancing data within a single machine.
5. I'd like support for the ability to say a node is going down, don't 
replicate, and the same for a rack, to ease maintenance.


-Steve

Re: Applications creates bigger output than input?

2011-05-02 Thread Steve Loughran


On 30/04/2011 05:31, elton sky wrote:

Thank you for suggestions:

Weblog analysis, market basket analysis and generating search index.

I guess for these applications we need more reduces than maps, for handling
large intermediate output, isn't it. Besides, the input split for map should
be smaller than usual,  because the workload for spill and merge on map's
local disk is heavy.


any form of rendering can generate very large images

see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Re: Execution time.

2011-04-27 Thread Steve Loughran


On 26/04/11 14:16, real great.. wrote:

Thanks a lot.I have managed to do it.
And my final year project is on power aware Hadoop. i do realise its against
ethics to get the code that way..:)


Good.

What do you mean by power aware

-awareness of the topology of UPS sources inside a datacentre
-awareness of CPU voltage level/power drain to schedule work where CPUs 
are capable of being most efficiently used, rather than scheduling work 
on a CPU that will have to ramp up to its full voltage and so be least 
efficient?


either would be interesting. You could use the existing rack topology 
scripts for UPS topology, but really there should be two topologies, as 
it's block placement where you need the UPS topology

Re: Cluster hardware question

2011-04-27 Thread Steve Loughran


On 26/04/11 14:55, Xiaobo Gu wrote:

Hi,

  People say a balanced server configration is as following:

  2 4 Core CPU, 24G RAM, 4 1TB SATA Disks

But we have been used to use storages servers with 24 1T SATA Disks,
we are wondering will Hadoop be CPU bounded if this kind of servers
are used. Does anybody have experiences with hadoop running on servers
with so many disks.


Some of the new clusters are running one or two 6 core CPUs with 12*2TB 
3.5 HDDs for storage, as this gives maximum storage density (it fits in 
a 1U). The exact ratio of CPU:RAM:disk depends on the application.


What you get with the big servers is
 -more probability of local access
 -great IO bandwidth, especially if you set up the mapred.temp.dir 
value to include all the drives.
 -less servers means less network ports on the switches, so you can 
save some money in the network fabric, and in time/effort cabling 
everything up.


What do you lose?
 -in a small cluster, loss of a single machine matters
 -in a large cluster, loss of a single machine can generate up to 24TB 
of replication traffic (more once 3TB HDDs become affordable)
 -in a large cluster, loss of a rack (or switch) can generate a very 
large amount of traffic.


If you were building a large (muti PB) cluster, this design is good for 
storage density -you could get a petabyte in a couple of racks, though 
the replication costs of a Top of Rack switch failure might push you 
towards 2xToR switches and bonded NICs, which introduce a whole new set 
of problems.


For smaller installations? I don't know.

-Steve

Re: Unsplittable files on HDFS

2011-04-27 Thread Steve Loughran


On 27/04/11 10:48, Niels Basjes wrote:

Hi,

I did the following with a 1.6GB file
hadoop fs -Ddfs.block.size=2147483648 -put
/home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
and I got

Total number of blocks: 1
4189183682512190568:10.10.138.61:50010  
10.10.138.62:50010

Yes, that does the trick. Thank you.

Niels

2011/4/27 Harsh Jha...@cloudera.com:

Hey Niels,

The block size is a per-file property. Would putting/creating these
gzip files on the DFS with a very high block size (such that it
doesn't split across for such files) be a valid solution to your
problem here?



Don't set a block size 2GB, not all the bits of the code that use 
signed 32 bit integers have been eliminated yet.

Re: Seeking Advice on Upgrading a Cluster

2011-04-26 Thread Steve Loughran


On 21/04/11 18:33, Geoffry Roberts wrote:


What will give me the most bang for my buck?

- Should I bring all machines up to 8G of memory? or is 4G good enough?
(8 is the max.)


depends on whether your code is running out of memory


- Should I double up the NICs and use LACP?


I would only recommend this for increasing availability at the expense 
of time spent getting it all to work.




- Should I double up the disks and attempt to flow my I/O from one disk
to the another on the theory that this will minimizing contention?


if your app is bandwidth bound (iotop should tell you this) then yes, 
this will help.



- Should I get another switch?  (I have a 10/100, 24 port Dlink and it's
about 5 years old.)


a gigabit switch is low cost now, I'd do that as one of my actions

Why not do some experiments by going to a smaller cluster and doubling 
the RAM and HDD from that cluster with those from your existing 
machines, and see which benefits your code the most?

Re: Fixing a bad HD

2011-04-26 Thread Steve Loughran


On 26/04/11 05:20, Bharath Mundlapudi wrote:

Right, if you have a hardware which supports hot-swappable disk, this might be 
easiest one. But still you will need to restart the datanode to detect this new 
disk. There is an open Jira on this.

-Bharath



That'll be HDFS-664
 https://issues.apache.org/jira/browse/HDFS-664

Nobody is working on this, all contributions welcome

Re: Fixing a bad HD

2011-04-26 Thread Steve Loughran


On 26/04/11 05:20, Bharath Mundlapudi wrote:

Right, if you have a hardware which supports hot-swappable disk, this might be 
easiest one. But still you will need to restart the datanode to detect this new 
disk. There is an open Jira on this.

-Bharath


Correction, there is a patch up there now. If you wan't to get involved 
in the coding of Hadoop to meet your specific needs, this might be the 
place to start

Re: HOD exception: java.io.IOException: No valid local directories in property: mapred.local.dir

2011-04-12 Thread Steve Loughran


On 11/04/2011 16:48, Boyu Zhang wrote:

Exception in thread main org.apache.hadoop.ipc.RemoteException:
java.io.IOException: No valid local directories in property:
mapred.local.dir


The job tracker can't find any of the local filesystem directories 
listed in the mapred.local.dir property, either the conf file or the 
machine is misconfigured

Re: Reg HDFS checksum

2011-04-12 Thread Steve Loughran


On 12/04/2011 07:06, Josh Patterson wrote:

If you take a look at:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

you'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the checksum is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.

For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the hop
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed hdfs-checksum command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.

I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?


Assuming it came down over HTTP, it's perfectly conceivable that 
something went wrong on the way, especially if a proxy server get 
involved. All HTTP checks is that the (optional) content length is 
consistent with what arrived -it relies on TCP checksums, which verify 
the network links work, but not the other bits of the system in the way 
(like any proxy server)

Re: Hadoop Pipes Error

2011-03-31 Thread Steve Loughran

On 31/03/11 07:53, Adarsh Sharma wrote:

Thanks Amareshwari,

here is the posting :
The *nopipe* example needs more documentation. It assumes that it is run
with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/
*WordCountInputFormat*.java, which has a very specific input split
format. By running with a TextInputFormat, it will send binary bytes as
the input split and won't work right. The *nopipe* example should
probably be recoded *to* use libhdfs *too*, but that is more complicated
*to* get running as a unit test. Also note that since the C++ example is
using local file reads, it will only work on a cluster if you have nfs
or something working across the cluster.

Please need if I'm wrong.

I need to run it with TextInputFormat.

If posiible Please explain the above post more clearly.

Here goes.

1.
The *nopipe* example needs more documentation. It assumes that it is run
with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/
*WordCountInputFormat*.java, which has a very specific input split
format. By running with a TextInputFormat, it will send binary bytes as
the input split and won't work right.

The input for the pipe is the content generated by
src/test/org/apache/hadoop/mapred/pipes/WordCountInputFormat.java

This is covered here.
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0

I would recommend following the tutorial here, or either of the books
Hadoop the definitive guide or Hadoop in Action. Both authors earn
their money by explaining how to use Hadoop, which is why both books are
good explanations of it.

2.
The *nopipe* example should
probably be recoded *to* use libhdfs *too*, but that is more complicated
*to* get running as a unit test.

Ignore that -it's irrelevant for your problem as owen is discussing
automated testing.

Also note that since the C++ example is
using local file reads, it will only work on a cluster if you have nfs
or something working across the cluster.

unless your cluster has a shared filesystem at the OS level it won't
work. Either have a shared filesystem like NFS, or run it on a single
machine.

-Steve

Re: does counters go the performance down seriously?

2011-03-29 Thread Steve Loughran


On 28/03/11 23:34, JunYoung Kim wrote:

hi,

this linke is about hadoop usage for the good practices.

http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
 by Arun C Murthy

if I want to use about 50,000 counters for a job, does it cause serious 
performance down?





Yes, you will use up lots of JT memory and so put limits on the overall 
size of your cluster.


If you have a small cluster and can crank up the memory settings on the 
JT to 48 GB this isn't going to be an issue, but as Y! are topping out 
at these numbers anyway, lots of counters just overload them.

Re: ant version problem

2011-03-28 Thread Steve Loughran


On 27/03/11 21:02, Daniel McEnnis wrote:

Steve,

Here it is:

user@ubuntu:~/src/trunk$ ant -diagnostics
--- Ant diagnostics report ---
Apache Ant version 1.8.0 compiled on May 9 2010

---
  Implementation Version
---
core tasks : 1.8.0 in file:/usr/share/ant/lib/ant.jar
optional tasks : 1.8.0 in file:/usr/share/ant/lib/ant-nodeps.jar



OK, that's a linux distro install without any extra jars. There's enough 
in a JVM these days that the basic operations will all work.










compile-rcc-compiler:
[javac] /home/user/src/trunk/build.xml:333: warning:
'includeantruntime' was not set, defaulting to
build.sysclasspath=last; set to false for repeatable builds

BUILD FAILED
/home/user/src/trunk/build.xml:338: taskdef A class needed by class
org.apache.hadoop.record.compiler.ant.RccTask cannot be found: Task
  using the classloader


Looking at that class,
http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/src/core/org/apache/hadoop/record/compiler/ant/RccTask.java?view=markup

I don't see any obvious dependencies on non-standard things,

20  import java.io.File;
21  import java.util.ArrayList;
22  import org.apache.hadoop.record.compiler.generated.Rcc;
23  import org.apache.tools.ant.BuildException;
24  import org.apache.tools.ant.DirectoryScanner;
25  import org.apache.tools.ant.Project;
26  import org.apache.tools.ant.Task;
27  import org.apache.tools.ant.types.FileSet;

The only odd one is the generated Rcc taks.

That taskdef message from ant is saying something imported is missing.

Try running ant in debug mode (ant -debug) to see if it gives a clue, 
but there's very little that anyone else can do here.

Re: observe the effect of changes to Hadoop

2011-03-27 Thread Steve Loughran


On 25/03/2011 14:10, bikash sharma wrote:

Hi,
For my research project, I need to add a couple of functions in
JobTracker.java source file to include additional information about
TaskTrackers resource usage through heartbeat messages. I made those changes
to JobTracker.java file. However, I am not  very clear how to see these
effects. I mean what are the next steps in terms of building the entire
Hadoop code base, using the built distribution and installing it again in
the cluster, etc?


If you are working with the Job Tracker you only need to rebuild the 
mapreduce JARs and push the new JAR out to the Job Tracker server, 
restart that process.


For more safety, put the same JAR on all the task trackers and shut down 
HDFS before the updates, but that's potentially overkil



Any elaborate updates on these will be very useful since I do not have much
experience in doing modifications to Hadoop like huge code base and
observing the effects of these changes.


I'd recommend getting everything working on a local machine single VM 
(the MiniMRCluster class helps), then move to multiple VMs and finally, 
if the code looks good, a real cluster with data you don't value.


-stee

Re: ant version problem

2011-03-27 Thread Steve Loughran


On 27/03/2011 02:01, Daniel McEnnis wrote:

Dear Hadoop,

Which version of ant do I need to keep the hadoop build from failing.
Netbeans ant works as well as eclipse ant works. However, ant 1.8.2
does not, nor does the
default ant from Ubuntu 10.10.  Snippet from failure to follow:



1.8.2 will work, but when you install via a linux distro the classpath 
is trickier to set up, as you need to get all dependent jars on the 
classpath.


what does ant -diagnostics say?



compile-rcc-compiler:
[javac] /home/user/src/trunk/build.xml:333: warning:
'includeantruntime' was not set, defaulting to
build.sysclasspath=last; set to false for repeatable builds

BUILD FAILED
/home/user/src/trunk/build.xml:338: taskdef A class needed by class
org.apache.hadoop.record.compiler.ant.RccTask cannot be found: Task
  using the classloader
AntClassLoader[/home/user/src/trunk/build/classes:/home/user/src/trunk/conf:/home/user/.ivy2/cache/commons-logging/commons-logging/jars/commons-logging-1.1.1.jar:/home/user/.ivy2/cache/log4j/log4j/jars/log4j-1.2.15.jar:/home/user/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/home/user/.ivy2/cache/commons-codec/commons-codec/jars/commons-codec-1.4.jar:/home/user/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/home/user/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/home/user/.ivy2/cache/net.java.dev.jets3t/jets3t/jars/jets3t-0.7.1.jar:/home/user/.ivy2/cache/commons-net/commons-net/jars/commons-net-1.4.1.jar:/home/user/.ivy2/cache/org.mortbay.jetty/servlet-api-2.5/jars/servlet-api-2.5-6.1.14.jar:/home/user/.ivy2/cache/net.sf.kosmosfs/kfs/jars/kfs-0.3.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jetty/jars/jetty-6.1.14.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jetty-util/jars/jetty-util-6.1.14.jar:/home/user/.ivy2/cache/t

omcat/jasper-runtime/jars/jasper-runtime-5.5.12.jar:/home/user/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.12.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jsp-api-2.1/jars/jsp-api-2.1-6.1.14.jar:/home/user/.ivy2/cache/org.mortbay.jetty/jsp-2.1/jars/jsp-2.1-6.1.14.jar:/home/user/.ivy2/cache/commons-el/commons-el/jars/commons-el-1.0.jar:/home/user/.ivy2/cache/oro/oro/jars/oro-2.0.8.jar:/home/user/.ivy2/cache/jdiff/jdiff/jars/jdiff-1.0.9.jar:/home/user/.ivy2/cache/junit/junit/jars/junit-4.8.1.jar:/home/user/.ivy2/cache/hsqldb/hsqldb/jars/hsqldb-1.8.0.10.jar:/home/user/.ivy2/cache/commons-logging/commons-logging-api/jars/commons-logging-api-1.1.jar:/home/user/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.5.11.jar:/home/user/.ivy2/cache/org.eclipse.jdt/core/jars/core-3.1.1.jar:/home/user/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.5.11.jar:/home/user/.ivy2/cache/org.apache.hadoop/avro/jars/avro-1.3.2.jar:/home/user/.ivy2/cache/org.codehaus.jackson/j
ackson-mapper-asl/jars/jackson-mapper-asl-1.4.2.jar:/home/user/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.4.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer-ant/jars/paranamer-ant-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.paranamer/paranamer-generator/jars/paranamer-generator-2.2.jar:/home/user/.ivy2/cache/com.thoughtworks.qdox/qdox/jars/qdox-1.10.1.jar:/home/user/.ivy2/cache/asm/asm/jars/asm-3.2.jar:/home/user/.ivy2/cache/commons-lang/commons-lang/jars/commons-lang-2.5.jar:/home/user/.ivy2/cache/org.aspectj/aspectjrt/jars/aspectjrt-1.6.5.jar:/home/user/.ivy2/cache/org.aspectj/aspectjtools/jars/aspectjtools-1.6.5.jar:/home/user/.ivy2/cache/org.mockito/mockito-all/jars/mockito-all-1.8.2.jar:/home/user/.ivy2/cache/com.jcraft/jsch/jars/jsch-0.1.42.jar]


Total time: 6 seconds

Sincerely,

Daniel McEnnis.

Re: CDH and Hadoop

2011-03-24 Thread Steve Loughran


On 23/03/11 15:32, Michael Segel wrote:


Rita,

It sounds like you're only using Hadoop and have no intentions to really get 
into the internals.

I'm like most admins/developers/IT guys and I'm pretty lazy.
I find it easier to set up the yum repository and then issue the yum install 
hadoop command.

The thing about Cloudera is that they do back port patches so that while their 
release is 'heavily patched'.
But they are usually in some sort of sync with the Apache release. Since you're 
only working with HDFS and its pretty stable, I'd say go with the Cloudera 
release.


to be fair, the Y! version of 0.20.x has all the backportings to do with 
scale, on a large cluster I'd pick up that one, with the understanding 
that if you have support problems, you can't pay Cloudera to hold your hand.


If you have any plans to get involved in the Hadoop  friends code, to 
move from a user to contributor, you should get with the official 
releases. Similarly, if you have some problem and want to file a bug, 
you should get the latest official release and test with that, as

 -that will be the first question on the bug report is it still there?
 -you'll need to help debug it.

Going forward, there are plans to do RPM and ideally deb artifacts of 
0.22 and later versions of Hadoop, making them easier to install. This 
still leaves the question of who supports it, the answers being you, or 
anyone you pay to, that being the way open source works


-steve

Re: Creating bundled jar files for running under hadoop

2011-03-23 Thread Steve Loughran


On 22/03/11 13:34, Andy Doddington wrote:

I am trying to create a bundled jar file for running using the hadoop ‘jar’ 
command. However, when I try to do this it fails to
find the jar files and other resources that I have placed into the jar (pointed 
at by the Class-Path property in the MANIFEST.MF).

I have unpacked the jar file into a temporary directory and run it manually 
using the java -jar command (after manually listing
the various hadoop jar files that are required in the classpath) and it works 
fine in this mode.

I have read on some places that the workaround is to put my jar files and other 
resources into the hadoop ‘lib’ directory, but
I really don’t want to do this, since it feels horribly kludgy.


Putting them in lib/ is bad as you need to restart the cluster to get 
changes out, and you can't have jobs with different versions in them


use the -libjars option instead


I don’t mind creating a ‘lib’ directory in my jar file, since this will be
hidden, but I would appreciate more documentation as to why/how I need to do 
this.


because the class-path manifest thing is designed for classloaders in 
the local filesystem where relative paths are used to find JARs and 
everything runs locally, and is only handled by the normal Java main 
entry point and the applet loader.




Can anybody advise as to how I can get 'hadoop jar’ to honour the Class-Path 
entry in the MANIFEST.MF?


You'd probably have to write the code to
 -parse the entry at job submission
 -find the JARs or fail
 -somehow include all these JARs in the list of dependencies for the 
job submission list via a transform to the -libjars command.

 -add the tests for this

I'm not sure it's worth the effort.

Re: decommissioning node woes

2011-03-21 Thread Steve Loughran


On 19/03/11 16:00, Ted Dunning wrote:

Unfortunately this doesn't help much because it is hard to get the ports to
balance the load.

On Fri, Mar 18, 2011 at 8:30 PM, Michael Segelmichael_se...@hotmail.comwrote:


With a 1GBe port, you could go 100Mbs for the bandwidth limit.
If you bond your ports, you could go higher.





Port bonding is possible, its just harder to
 -set up all the cabling
 -be sure both ports are fully utilised

It's less expensive than 10G ether because those switches cost a lot 
more, and with 2x1 you can have separate ToR switches for more redundancy.


For decommissioning, why not boost the rebalance bandwidth before you 
trigger the decommission, then drop it afterwards.


-steve

Re: Installing Hadoop on Debian Squeeze

2011-03-21 Thread Steve Loughran


On 21/03/11 09:00, Dieter Plaetinck wrote:

On Thu, 17 Mar 2011 19:33:02 +0100
Thomas Kochtho...@koch.ro  wrote:


Currently my advise is to use the Debian packages from cloudera.


That's the problem, it appears there are none.
Like I said in my earlier mail, Debian is not in Cloudera's list of
supported distros, and they do not have a repository for Debian
packages. (I tried the ubuntu repository but that didn't work)

I now have installed it by just downloading and extracting the
tarball, it seems that's basically all that is needed.


Dieter


There's an open JIRA on having Apache release its own Hadoop RPMs, 
pushing out debian JIRAs would go alongside this, but that requires on 
someone else to volunteer the work...

Public Talks from Yahoo! and LinkedIn in Bristol, England, Friday Mar 25

2011-03-18 Thread Steve Loughran



This isn't relevant for people who don't live in or near South England 
or Wales, but for those that do, I'm pleased to announce that Owen 
O'Malley and Sanjay Radia of Yahoo! and Jakob Homan of LinkedIn will all 
be giving public talks on Hadoop on Friday March 25 at HP Laboratories, 
in Bristol.


http://hphadoop.eventbrite.com/

If you can come along, great! If not, they are giving the same talks in 
London on the Wednesday -at a talk that is already booked up-, where 
Yahoo! will be videoing the talks for them to go up online afterwards. 
You'll all be able to catch up from the luxury of your own laptop.


Looking forward to meeting other Hadoop developers and users next week,

Steve

Re: Hadoop code base splits

2011-03-17 Thread Steve Loughran


On 17/03/11 07:05, Matthew John wrote:

Hi,

Can someone provide me some pointers on the following details of
Hadoop code base:

1) breakdown of HDFS code base (approximate lines of code) into
following modules:
  - HDFS at the Datanodes
  - Namenode
  - Zookeeper
  - MapReduce based
  - Any other relevant split

2) breakdown of Hbase code into following modules:
  - HMaster
  - RegionServers
  - MapReduce
  - Any other relevant split



You are free to check out the source code and do whatever analysis you 
want. You can also look at the entire SVN history and do some really 
interesting analysis, especially if you have any data mining tooling to 
hand, like a small hadoop cluster.

Re: Load testing in hadoop

2011-03-15 Thread Steve Loughran


On 15/03/11 04:59, Kannu wrote:


Please tell me how to use synthetic load generator in hadoop or suggest me
any other way of load testing in hadoop cluster.

thanks,
kannu


terasort is the one most people use, as it generates its own datasets. 
Otherwise you need a few TB of data and some custom code to run on it.


Gridmix/gridmix2 can also simulate load.

All this stuff is in the hadoop source tree, with documentation

Re: Speculative execution

2011-03-03 Thread Steve Loughran

On 02/03/11 21:01, Keith Wiley wrote:

I realize that the intended purpose of speculative execution is to overcome individual
slow tasks...and I have read that it explicitly is *not* intended to start copies of a
task simultaneously and to then race them, but rather to start copies of tasks that
seem slow after running for a while.

...but aside from merely being slow, sometimes tasks arbitrarily fail, and not
in data-driven or otherwise deterministic ways. A task may fail and then
succeed on a subsequent attempt...but the total job time is extended by the
time wasted during the initial failed task attempt.

yes, but the problem is determining which one will fail. Ideally you
should find the route cause, which is often some race condition or
hardware fault. If it's the same server ever time, turn it off.

It would super-swell to run copies of a task simultaneously from the starting line and simply kill
the copies after the winner finishes. While is is wasteful in some sense (that is the
argument offered for not running speculative execution this way to begin with), it would more
precise to say that different users may have different priorities under various use-case scenarios.
The wasting of duplicate tasks on extra cores may be an acceptable cost toward the
higher priority of minimizing job times for a given application.

Is there any notion of this in Hadoop?

You can play with the specex parameters, maybe change when they get
kicked off. The assumption in the code is that the slowness is caused by
H/W problems (especially HDD issues) and it tries to avoid duplicate
work. If every Map was duplicated, you'd be doubling the effective cost
of each query, and annoying everyone else in the cluster. Plus increased
disk and network IO might slow things down.

Look at the options, have a play and see. If it doesn't have the
feature, you can always try coding it in -if the scheduler API lets it
do it, you wont' be breaking anyone else's code.

-steve

Re: recommendation on HDDs

2011-02-14 Thread Steve Loughran


On 10/02/11 22:25, Michael Segel wrote:


Shrinivas,

Assuming you're in the US, I'd recommend the following:

Go with 2TB 7200 SATA hard drives.
(Not sure what type of hardware you have)

What  we've found is that in the data nodes, there's an optimal configuration 
that balances price versus performance.

While your chasis may hold 8 drives, how many open SATA ports are on the 
motherboard? Since you're using JBOD, you don't want the additional expense of 
having to purchase a separate controller card for the additional drives.



I'm not going to disagree about cost, but I will note that a single 
controller can become a bottleneck once you add a lot of disks to it; it 
generates lots of interrupts that go to the came core, which then ends 
up at 100% CPU and overloading. With two controllers the work can get 
spread over two CPUs, moving the bottlenecks back into the IO channels.


For that reason I'd limit the #of disks for a single controller at 
around 4-6.


Remember as well as storage capacity, you need disk space for logs, 
spill space, temp dirs, etc. This is why 2TB HDDs are looking appealing 
these days


Speed? 10K RPM has a faster seek time and possibly bandwidth but you pay 
in capital and power. If the HDFS blocks are laid out well, seek time 
isn't so important, so consider saving the money and putting it elsewhere.


The other big question with Hadoop is RAM and CPU, and the answer there 
is it depends. RAM depends on the algorithm, as can the CPU:spindle 
ratio ... I recommend 1 core to 1 spindle as a good starting point. In a 
large cluster the extra capital costs of a second CPU compared to the 
amount of extra servers and storage that you could get for the same 
money speaks in favour of more servers, but in smaller clusters the 
spreadsheets say different things.


-Steve

(disclaimer, I work for a server vendor :)

Re: recommendation on HDDs

2011-02-14 Thread Steve Loughran


On 12/02/11 16:26, Michael Segel wrote:


All,

I'd like to clarify somethings...

First the concept is to build out a cluster of commodity hardware.
So when you do your shopping you want to get the most bang for your buck. That 
is the 'sweet spot' that I'm talking about.
When you look at your E5500 or E5600 chip sets, you will want to go with 4 
cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
(Faster chips are more expensive and the performance edge falls off so you end 
up paying a premium.)


Interesting choice; the 7 core in a single CPU option is something else 
to consider. Remember also this is a moving target, what anyone says is 
valid now (Feb 2011) will be seen as quaint in two years time. Even a 
few months from now, what is the best value for a cluster will hve moved on.




Looking at your disks, you start with using the on board SATA controller. Why? 
Because it means you don't have to pay for a controller card.
If you are building a cluster for general purpose computing... Assuming 1U boxes you 
have room for 4 3.5 SATA which still give you the best performance for your 
buck.
Can you go with 2.5? Yes, but you are going to be paying a premium.

Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You 
could go with SATA III drives if your motherboard supports the SATA III ports, 
but you're still paying a slight premium.

The OP felt that all he would need was 1TB of disk and was considering 4 250GB 
drives. (More spindles...yada yada yada...)

My suggestion is to forget that nonsense and go with one 2 TB drive because its 
a better deal and if you want to add more disk to the node, you can. (Its 
easier to add disk than it is to replace it.)

Now do you need to create a spare OS drive? No. Some people who have an 
internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging 
there. (Just make sure you have a lot of disk space...)


One advantage of a specific drive for OS and log (in a separate 
partition) is you can re-image it without losing data you care about, 
and swap in a replacement fast. If you have a small cluster set up for 
hotswap, that reduces the time a node is down -just have a spare OS HDD 
ready to put in. OS disks are the ones you care about when they fail, 
the others are more mildly concerned about the failure rate than 
something to page you over.




The truth is that there really isn't any single *right* answer. There are a lot 
of options and budget constraints as well as physical constraints like power, 
space, and location of the hardware.


+1. don't forget weight either.



Also you may be building out a cluster who's main purpose is to be a backup 
location for your cluster. So your production cluster has lots of nodes. Your 
backup cluster has lots of disks per node because your main focus is as much 
storage per node.

So here you may end up buying a 4U rack box, load it up with 3.5 drives and a 
couple of SATA controller cards. You care less about performance but more about 
storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know 
how many you can fit in to a 4U chassis these days.  So you have 10 DN backing up a 
100+ DN cluster in your main data center. But that's another story.


You can get 12 HDDs in a 1U if you ask nicely. but in a small cluster 
there's a cost, that server can be a big chunk of your filesystem, and 
if it goes down there's up to 24TB worth of replication going to take 
place over the rest of the network, so you'll need at least 24TB of 
spare capacity on the other machines, ignoring bandwidth issues.




I think the main take away you should have is that if you look at the price 
point... your best price per GB is on a 2TB drive until the prices drop on 3TB 
drives.
Since the OP believes that their requirement is 1TB per node... a single 2TB 
would be the best choice. It allows for additional space and you really 
shouldn't be too worried about disk i/o being your bottleneck.



One less thing to worry about is good.

Re: CUDA on Hadoop

2011-02-10 Thread Steve Loughran


On 09/02/11 17:31, He Chen wrote:

Hi sharma

I shared our slides about CUDA performance on Hadoop clusters. Feel free to
modified it, please mention the copyright!


This is nice. If you stick it up online you should link to it from the 
Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it

Re: hadoop infrastructure questions (production environment)

2011-02-09 Thread Steve Loughran


On 08/02/11 15:45, Oleg Ruchovets wrote:

Hi , we are going to production and have some questions to ask:

We are using 0.20_append  version (as I understand it is  hbase 0.90
requirement).


1) Currently we have to process 50GB text files per day , it can grow to
150GB
   -- what is the best hadoop file size for our load and are there
suggested disk block size for that size?


depends on the #of machines and their performance. The smaller the 
blocks, the better the #of maps that can be assigned blocks, but it puts 
more load on the namenode and job tracker



   -- We worked using gz and I saw that for every files 1 map task
was assigned.
   What is the best practice:  to work with gz files and save
disc space or work without archiving ?


Hadoop sequence files can be compressed on a per-block basis. It's not 
as efficient as gz, but reduces your storage and network load.




   Lets say we want to get performance benefits and disk
space is less critical.

2)  Currently adding additional machine to the greed we need manually
maintain all files and configurations.
  Is it possible to auto-deploy hadoop servers without the need to
manually define each one on all nodes?


That's the only way people do it in production clusters: you use 
Configuration Management (CM) tools. Which one you use is your choice, 
but do use one.





3) Can we change masters without reinstalling the entire grid


-if you can push out a new configuration and restart the workers, you 
can move the master nodes to any machine in the cluster after a failure.


-if you want to leave the nn and JT hostnames the same but change IP 
addresses, you need to restart all the workers, and make sure the DNS 
entries of the master nodes are set to expire rapidly so the OS doesn't 
cache it for long.


-if you have machines set up with the same hostname and IP addresses, 
then you can bring them up as the masters, just have the namenode 
recover the edit log.

Re: CUDA on Hadoop

2011-02-09 Thread Steve Loughran


On 09/02/11 13:58, Harsh J wrote:

You can check-out this project which did some work for Hama+CUDA:
http://code.google.com/p/mrcl/


Amazon let you bring up a Hadoop cluster on machines with GPUs you can 
code against, but I haven't heard of anyone using it. The big issue is 
bandwidth; it just doesn't make sense for a classic scan through the 
logs kind of problem as the disk:GPU bandwidth ratio is even worse than 
disk:CPU.


That said, if you were doing something that involved a lot of compute on 
a block of data (e.g. rendering tiles in a map), this could work.

Re: How to speed up of Map/Reduce job?

2011-02-01 Thread Steve Loughran


On 01/02/11 08:19, Igor Bubkin wrote:

Hello everybody

I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount
example. It takes about 20 sec for processing of 1,5MB text file. We want to
use Map/Reduce in real time (interactive: by user's requests). User can't
wait for his request 20 sec. This is too long. Is it possible to reduce time
of Map/Reduce job? Or may be I misunderstand something?


1. I'd expect a minimum 30s query time due to the way work gets queued 
and dispatched, JVM startup costs etc. There is no way to eliminate this 
in Hadoop's current architecture.


2. 1.5M is a very small file size; I'm currently recommending a block 
size of 512M in new clusters for various reasons. This size of data is 
just too small to bother with distribution. Load it up into memory; 
analyse it locally. Things like Apache CouchDB also support MapReduce.


Hadoop is not designed for clusters of less than about 10 machines (not 
enough redundancy of storage), or for small datasets. If your problems 
aren't big enough, use different tools, because Hadoop contains design 
decisions and overheads that only make sense once your data is measured 
in GB and your filesystem in tens to thousands of Terabytes.

Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Steve Loughran


On 27/01/11 07:28, Manuel Meßner wrote:

Hi,

you may want to take a look into the streaming api, which allows users
to write there map-reduce jobs with any language, which is capable of
writing to stdout and reading from stdin.

http://hadoop.apache.org/mapreduce/docs/current/streaming.html

furthermore pig and hive are hadoop related projects and may be of
interest for non java people:

http://pig.apache.org/
http://hive.apache.org/

So finally my answer: no it isn't ;)


Helps if your ops team have some experience in running java app servers 
or similar, as well as large linux clusters

Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-27 Thread Steve Loughran


On 27/01/11 10:51, Renaud Delbru wrote:

Hi Koji,

thanks for sharing the information,
Is the 0.20-security branch planned to be a official release at some
point ?

Cheers


If you can play with the beta you can see that it works for you and if 
not, get bugs fixed during the beta cycle


http://people.apache.org/~acmurthy/hadoop-0.20.100-rc0/

Re: Why Hadoop is slow in Cloud

2011-01-21 Thread Steve Loughran


On 20/01/11 23:24, Marc Farnum Rendino wrote:

On Wed, Jan 19, 2011 at 2:50 PM, Edward Caprioloedlinuxg...@gmail.com  wrote:

As for virtualization,paravirtualization,emulation.(whatever ulization)


Wow; that's a really big category.


There are always a lot of variables, but the net result is always
less. It may be 2% 10% or 15%, but it is always less.


If it's less of something I don't care about, it's not a factor (for me).

On the other hand, if I'm paying less and getting more of what I DO
care about, I'd rather go with that.

It's about the cost/benefit *ratio*.


There's also perf vs storage. On a big cluster, you could add a second 
Nehalem CPU and maybe get 10-15% boost on throughput, or for the same 
capex and opex add 10% new servers, which at scale means many more TB of 
storage and the compute to go with it. The decision rests with the team 
and their problems.

Re: Why Hadoop is slow in Cloud

2011-01-21 Thread Steve Loughran


On 21/01/11 09:20, Evert Lammerts wrote:

Even with performance hit, there are still benefits running Hadoop this
way
   -as you only consume/pay for CPU time you use, if you are only
running
batch jobs, its lower cost than having a hadoop cluster that is under-
used.

   -if your data is stored in the cloud infrastructure, then you need to
data mine it in VMs, unless you want to take the time and money hit of
moving it out, and have somewhere to store it.

-if the infrastructure lets you, you can lock down the cluster so it is
secure.

Where a physical cluster is good is that it is a very low cost way of
storing data, provided you can analyse it with Hadoop, and provided you
can keep that cluster busy most of the time, either with Hadoop work or
other scheduled work. If your cluster is idle for computation, you are
still paying the capital and (reduced) electricity costs, so the cost
of
storage and what compute you do effectively increases.


Agreed, but this has little to do with Hadoop as a middleware and more to do
with the benefits of virtualized vs physical infrastructure. I agree that it
is convenient to use HDFS as a DFS to keep your data local to your VMs, but
you could choose other DFS's as well.


We don't use HDFS, we bring up VMs close to where the data persists.

http://www.slideshare.net/steve_l/high-availability-hadoop



The major benefit of Hadoop is its data-locality principle, and this is what
you give up when you move to the cloud. Regardless of whether you store your
data within your VM or on a NAS, it *will* have to travel over a line. As
soon as that happens you lose the benefit of data-locality and are left with
MapReduce as a way for parallel computing. And in that case you could use
less restrictive software, like maybe PBS. You could even install HOD on
your virtual cluster, if you'd like the possibility of MapReduce.


We don't suffer locality hits so much, but you do pay for the extra 
infrastructure costs of a more agile datacentre, and if you go to 
redundancy in hardware over replication, you have less places to run 
your code.


Even on EC2, which doesn't let you tell it what datasets you want to 
play with for its VM placer to use in its decisions, once data is in the 
datanodes you do get locality




Adarsh, there are probably results around of more generic benchmark tools
(Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
give you a better idea of the penalties of virtualization. (Our experience
with a number of technologies on our OpenNebula cloud is, like Steve points
out, that you mainly pay for disk I/O performance.)


-would be interesting to see anything you can publish there...



I think a decision to go with either cloud or physical infrastructure should
be based on the frequency, intensity and types of computation you expect on
the short term (that should include operations dealing with data), and the
way you think these parameters will develop on a mid-long term. And then
compare the prices of a physical cluster that meets those demands (make sure
to include power and operations) and the investment you would otherwise need
to make in Cloud.


+1

Re: namenode format error during setting up hadoop using eclipse in windows 7

2011-01-20 Thread Steve Loughran


On 20/01/11 10:26, arunk786 wrote:



Arun K@sairam ~/hadoop-0.19.2
$ bin/hadoop namenode -format
cygwin warning:
   MS-DOS style path detected:
C:\cygwin\home\ARUNK~1\HADOOP~1.2\/build/native
   Preferred POSIX equivalent is: /home/ARUNK~1/HADOOP~1.2/build/native
   CYGWIN environment variable option nodosfilewarning turns off this
warning.
   Consult the user's guide for more details about POSIX paths:
 http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
bin/hadoop: line 243: /cygdrive/c/Program: No such file or directory
11/01/19 03:53:13 INFO namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = sairam/10.0.1.105
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.19.2
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/b
ranch-0.19 -r 789657; compiled by 'root' on Tue Jun 30 12:40:50 EDT 2009
/
11/01/19 03:53:13 ERROR namenode.NameNode: java.io.IOException:
javax.security.a
uth.login.LoginException: Login failed: Expect one token as the result of
whoami
: sairam\arun k

*/




Looks like Hadoop doesn't expect a space in the user name. You could 
file a bug, though I doubt anyone is going to sit down and fix it other 
than you, that being the way of open source.

Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?

2011-01-19 Thread Steve Loughran


On 18/01/11 19:58, Koji Noguchi wrote:

Try moving up to v 6.1.25, which should be more straightforward.


FYI, when we tried 6.1.25, we got hit by a deadlock.
http://jira.codehaus.org/browse/JETTY-1264

Koji


Interesting. Given that there is now 6.1.26 out, that would be the one 
to play with.


Thanks for the heads up, I will move my code up to the .26 release,

-steve

Re: Why Hadoop is slow in Cloud

2011-01-17 Thread Steve Loughran


On 17/01/11 04:11, Adarsh Sharma wrote:

Dear all,

Yesterday I performed a kind of testing between *Hadoop in Standalone
Servers*  *Hadoop in Cloud.

*I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
which one node act as Master ( Namenode , Jobtracker ) and the remaining
nodes act as slaves ( Datanodes, Tasktracker ).
On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
one Standalone Machine as *Hadoop Master* and the slaves are configured
on the VM's in Cloud.

I am confused about the stats obtained after the testing. What I
concluded that the VM are giving half peformance as compared with
Standalone Servers.


Interesting stats, nothing that massively surprises me, especially as 
your benchmarks are very much streaming through datasets. If you were 
doing something more CPU intensive (graph work, for example), things 
wouldn't look so bad


I've done stuff in this area.
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud





I am expected some slow down but at this level I never expect. Would
this is genuine or there may be some configuration problem.

I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
Standalone Servers.

Please have a look on the results and if interested comment on it.




The big killer here is File IO, with today's HDD controllers and virtual 
filesystems, disk IO is way underpowered compared to physical disk IO. 
Networking is reduced (but improving), and CPU can be pretty good, but 
disk is bad.



Why?

1.  Every access to a block in the VM is turned into virtual disk 
controller operations which are then interpreted by the VDC and turned 
into reads/writes in the virtual disk drive


2. which is turned into seeks, reads and writes in the physical hardware.

Some workarounds

-allocate physical disks for the HDFS filesystem, for the duration of 
the VMs.


-have the local hosts serve up a bit of their filesystem on a fast 
protocol (like NFS), and have every VM mount the local physical NFS 
filestore as their hadoop data dirs.

Re: How to replace Jetty-6.1.14 with Jetty 7 in Hadoop?

2011-01-17 Thread Steve Loughran


On 16/01/11 09:41, xiufeng liu wrote:

Hi,

In my cluster, Hadoop somehow cannot work, and I found that it was due to
the Jetty-6.1.14 which is not able to start up. However, Jetty 7 can work in
my cluster.  Could any body know how to replace Jetty6.1.14 with Jetty7?

Thanks
afancy



The switch to jetty 7 will not be easy, and I wouldn't encourage you to 
do it unless you want to get into editing the Hadoop source, retesting 
everything,



Try moving up to v 6.1.25, which should be more straightforward. Replace 
the JAR, QA the cluster with some terasorting.

Re: TeraSort question.

2011-01-13 Thread Steve Loughran


On 11/01/11 16:40, Raj V wrote:

Ted


Thanks. I have all the graphs I need that include, map reduce timeline, system 
activity for all the nodes when the sort was running. I will publish them once 
I have them in some presentable format.,

For legal reasons, I really don't want to send the complete job histiory files.

My question is still this. When running terasort, would the CPU, disk and 
network utilization of all the nodes be more or less similar or completely 
different.


They can be different. The JT pushes out work to machines when they 
report in, some may get more work than others, so generate more local 
data. This will have follow-on consequences. In a live system things are 
different as the work tends to follow the data, so machines with (or 
near) the data you need get the work.


It's a really hard thing to say is the cluster working right, when 
bringing it up, everyone is really guessing about expected performance.


-Steve

Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread Steve Loughran


On 13/01/11 08:34, li ping wrote:

That is also my concerns. Is it efficient for data transmission.


It's long lived TCP connections, reasonably efficient for bulk data 
xfer, has all the throttling of TCP built in, and comes with some 
excellently debugged client and server code in the form of jetty and 
httpclient. In maintenance costs alone, those libraries justify HTTP 
unless you have a vastly superior option *and are willing to maintain it 
forever*


FTPs limits are well known (security), NFS limits well known (security, 
UDP version doesn't throttle), self developed protocols will have 
whatever problems you want.


There are better protocols for long-haul data transfer over fat pipes, 
such as GridFTP , PhedEX ( 
http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP 
channels in parallel to reduce the impact of a single lost packet, but 
within a datacentre, you shouldn't have to worry about this. If you do 
find lots of packets get lost, raise the issue with the networking team.


-Steve



On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com  wrote:


Hi, all

I have a question about the file transmission between Map and Reduce stage,
in current implementation, the Reducers get the results generated by
Mappers
through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
a
self-developed protocal?

Just for HTTP's simple?

thanks

Nan

Re: Hadoop Certification Progamme

2010-12-15 Thread Steve Loughran


On 09/12/10 03:40, Matthew John wrote:

Hi all,.

Is there any valid Hadoop Certification available ? Something which adds
credibility to your Hadoop expertise.



Well, there's always providing enough patches to the code to get commit 
rights :)

Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran


On 10/12/10 06:14, Amandeep Khurana wrote:

Mark,

Using EMR makes it very easy to start a cluster and add/reduce capacity as
and when required. There are certain optimizations that make EMR an
attractive choice as compared to building your own cluster out. Using EMR
also ensures you are using a production quality, stable system backed by the
EMR engineers. You can always use bootstrap actions to put your own tweaked
version of Hadoop in there if you want to do that.

Also, you don't have to tear down your cluster after every job. You can set
the alive option when you start your cluster and it will stay there even
after your Hadoop job completes.

If you face any issues with EMR, send me a mail offline and I'll be happy to
help.



How different is your distro from the apache version?

Re: Question from a Desperate Java Newbie

2010-12-15 Thread Steve Loughran

On 10/12/10 09:08, Edward Choi wrote:
 I was wrong. It wasn't because of the read once free policy. I tried again 
 with Java first again and this time it didn't work.
 I looked up google and found the Http Client you mentioned. It is the one 
 provided by apache, right? I guess I will have to try that one now. Thanks!
 

httpclient is good, HtmlUnit has a very good client that can simulate
things like a full web browser with cookies, but that may be overkill.

NYT's read once policy uses cookies to verify that you are there for the
first day not logged in, for later days you get 302'd unless you delete
the cookie, so stateful clients are bad.

What you may have been hit by is whatever robot trap they have -if you
generate too much load and don't follow the robots.txt rules they may
detect this and push back

Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran


On 09/12/10 18:57, Aaron Eng wrote:

Pros:
- Easier to build out and tear down clusters vs. using physical machines in
a lab
- Easier to scale up and scale down a cluster as needed

Cons:
- Reliability.  In my experience I've had machines die, had machines fail to
start up, had network outages between Amazon instances, etc.  These problems
have occurred at a far more significant rate than any physical lab I have
ever administered.
- Money. You get charged for problems with their system.  Need to add
storage space to a node?  That means renting space from EBS which you then
need to actually spend time formatting to ext3 so you can use it with
Hadoop.  So every time you want to use storage, you're paying Amazon to
format it because you can't tell EBS that you want an ext3 volume.
- Visibility.  Amazon loves to report that all their services are working
properly on their website, meanwhile, the reality is that they only report
issues if they are extremely major.  Just yesterday they reported increased
latency on their us-east-1 region.  In reality, increased latency means

50% of my Amazon API calls were timing out, I could not create new

instances and for about 2 hours I could not destroy the instances I had
already spun up.  Hows that for ya?  Paying them for machines that they
won't let me terminate...



that's the harsh reality of all VMs. you need to monitor and stamp on 
things that misbehave. The nice thing is: it's easy to do this, just get 
HTTP status pages and kill any VM


This is not a fault of EC2: any VM infra has this feature. You can't 
control where your VMs come up, you are penalised by other cpu-heavy 
machines on the same server, amazon throttle the smaller machines a bit.


But you
 -don't pay for cluster time you don't need
 -don't pay for ingress/egress for data you generate in the vendor's 
infrastructure (just storage)

 -can be very agile with cluster size.

I have a talk on this topic for the curious, discussing a UI that is a 
bit more agile, but even there we deploy agents to every node to keep an 
eye on the state of the cluster.


http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
http://blip.tv/file/3809976

Hadoop is designed to work well in a large-scale static cluster: fixed 
machines, with the reactions to client to server failure failure: spin 
and those of servers -blacklist clients- being the right ones to leave 
ops in control. In a virtual world you want the clients to see (somehow) 
if the master nodes have moved, you want the servers to kill the 
misbehaving VMs to save money, and then create new ones.


-Steve

1 2 3 >

1 - 100 of 248 matches

Mail list logo