RE: risks of using Hadoop

2011-09-21 Thread Bill Habermaas
Amen to that. I haven't heard a good rant in a long time, I am definitely 
amused end entertained. 

As a veteran of 3 years with Hadoop I will say that the SPOF issue is whatever 
you want to make it. But it has not, nor will it ever defer me from using this 
great system. Every system has its risks and they can be minimized by careful 
architectural crafting and intelligent usage. 

Bill

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, September 21, 2011 1:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


> Date: Wed, 21 Sep 2011 17:02:45 +0100
> Subject: Re: risks of using Hadoop
> From: kobina.kwa...@gmail.com
> To: common-user@hadoop.apache.org
> 
> Jignesh,
> 
> Will your point 2 still be valid if we hire very experienced Java
> programmers?
> 
> Kobina.
> 
> On 20 September 2011 21:07, Jignesh Patel  wrote:
> 
> >
> > @Kobina
> > 1. Lack of skill set
> > 2. Longer learning curve
> > 3. Single point of failure
> >
> >
> > @Uma
> > I am curious to know about .20.2 is that stable? Is it same as the one you
> > mention in your email(Federation changes), If I need scaled nameNode and
> > append support, which version I should choose.
> >
> > Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
> > updating the Hadoop API. When that will be integrated with Hadoop.
> >
> > If I need
> >
> >
> > -Jignesh
> >
> > On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
> >
> > > Hi Kobina,
> > >
> > > Some experiences which may helpful for you with respective to DFS.
> > >
> > > 1. Selecting the correct version.
> > >I will recommend to use 0.20X version. This is pretty stable version
> > and all other organizations prefers it. Well tested as well.
> > > Dont go for 21 version.This version is not a stable version.This is risk.
> > >
> > > 2. You should perform thorough test with your customer operations.
> > >  (of-course you will do this :-))
> > >
> > > 3. 0.20x version has the problem of SPOF.
> > >   If NameNode goes down you will loose the data.One way of recovering is
> > by using the secondaryNameNode.You can recover the data till last
> > checkpoint.But here manual intervention is required.
> > > In latest trunk SPOF will be addressed bu HDFS-1623.
> > >
> > > 4. 0.20x NameNodes can not scale. Federation changes included in latest
> > versions. ( i think in 22). this may not be the problem for your cluster.
> > But please consider this aspect as well.
> > >
> > > 5. Please select the hadoop version depending on your security
> > requirements. There are versions available for security as well in 0.20X.
> > >
> > > 6. If you plan to use Hbase, it requires append support. 20Append has the
> > support for append. 0.20.205 release also will have append support but not
> > yet released. Choose your correct version to avoid sudden surprises.
> > >
> > >
> > >
> > > Regards,
> > > Uma
> > > - Original Message -
> > > From: Kobina Kwarko 
> > > Date: Saturday, September 17, 2011 3:42 am
> > > Subject: Re: risks of using Hadoop
> > > To: common-user@hadoop.apache.org
> > >
> > >> We are planning to use Hadoop in my organisation for quality of
> > >> servicesanalysis out of CDR records from mobile operators. We are
> > >> thinking of having
> > >> a small cluster of may be 10 - 15 nodes and I'm preparing the
> > >> proposal. my
> > >> office requires that i provide some risk analysis in the proposal.
> > >>
> > >> thank you.
> > >>
> > >> On 16 September 2011 20:34, Uma Maheswara Rao G 72686
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> First of all where you are planning to use Hadoop?
> > >>>
> > >>> Regards,
> > >>> Uma
> > >>> - Original Message -
> > >>> From: Kobina Kwarko 
> > >>> Date: Saturday, September 17, 2011 0:41 am
> > >>> Subject: risks of using Hadoop
> > >>> To: common-user 
> > >>>
> >  Hello,
> > 
> >  Please can someone point some of the risks we may incur if we
> >  decide to
> >  implement Hadoop?
> > 
> >  BR,
> > 
> >  Isaac.
> > 
> > >>>
> > >>
> >
> >



Re: Fundamental question

2010-05-09 Thread Bill Habermaas
These questions are usually answered once you start using the system but 
I'll provide some quick answers.


1. Hadoop uses the local file system at each node to store blocks. The only 
part of the system that needs to be formatted is the namenode which is where 
Hadoop keeps track of the logical HDFS filesystem image that contains the 
directory structure, files and the datanodes where they reside. A file in 
HDFS is a sequence of blocks. When the file has a replication factor 
(usually 3) then each block has 3 exact copies that reside at different 
datanodes. This is important to remember for your second question.


2. The notion of processing locally is simply that map/reduce will process a 
file at different nodes by reading the blocks that are located at that 
location.  So if you have 3 copies of the same block at different nodes, 
then the system can pick nodes where it can process those blocks locally. In 
order to process the entire file, map/reduce runs parallel tasks that 
process the blocks locally at each node.  Once you have data in the HDFS 
cluster it is not necessary to move things around.  The framework does that 
transparently. An example might help: say file has blocks 1,2,3,4 which are 
replicated across 3 datanodes (A,B,C).  Due to replication there is a copy 
of each block residing at each node. When the map/reduce job is started by 
the jobtracker, it begins a task at each node: (A will process blocks 1 & 
2),   B will process block 3, and C will process block 4).  All these tasks 
run in parallel so if you are handling a terrabyte+ file there is a big 
reduction in processing time.  Each task writes it's map/reduce output to a 
specific output directory (in this case 3 files) which can be inputted to 
the next map/reduce job.


I hope this brief answer is helpful and provides some insight.

Bill



- Original Message - 
From: "Vijay Rao" 

To: 
Sent: Sunday, May 09, 2010 2:49 AM
Subject: Fundamental question



Hello,

I am just reading and understanding Hadoop and all the other components.
However I have a fundamental question for which I am not getting answers 
in

any of the online material that is out there.

1) If hadoop is used then all the slaves and other machines in the cluster
need to be formatted to have HDFS file system. If so what happens to the
tera bytes of data that need to be crunched? Or is the data on a different
machine?

2) Everywhere it is mentioned that the main advantage of map/reduce and
hadoop is that it runs on data that is available locally. So does this 
mean

that once the file system is formatted then I have to move my terabytes of
data and split them across the cluster?

Thanks
VJ





RE: How can I syncronize writing to an hdfs file

2010-05-07 Thread Bill Habermaas
I had a similar requirement. Hdfs has no locking that I am aware of, at
least I have never run across it in reading the source. My solution was to
build a distributed locking mechanism using ZooKeeper.  You might want to
visit http://hadoop.apache.org/zookeeper/docs/current/recipes.html
For some ideas.  The code you find there is a start but buggy. 

Bill 

-Original Message-
From: Raymond Jennings III [mailto:raymondj...@yahoo.com] 
Sent: Friday, May 07, 2010 10:32 AM
To: common-user@hadoop.apache.org
Subject: How can I syncronize writing to an hdfs file

I want to write to a common hdfs file from within my map method.  Given that
each task runs in a separate jvm (on separate machines) making a method
syncronized will not work I assume.  Are there any file locking or other
methods to guarantee mutual exclusion on hdfs?

(I want to append to this file and I have the append option turned on.)
Thanks.


  




RE: why does 'jps' lose track of hadoop processes ?

2010-03-29 Thread Bill Habermaas
Sounds like your pid files are getting cleaned out of whatever directory
they are being written (maybe garbage collection on a temp directory?). 

Look at (taken from hadoop-env.sh):
# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids

The hadoop shell scripts look in the directory that is defined.

Bill

-Original Message-
From: Raymond Jennings III [mailto:raymondj...@yahoo.com] 
Sent: Monday, March 29, 2010 11:37 AM
To: common-user@hadoop.apache.org
Subject: why does 'jps' lose track of hadoop processes ?

After running hadoop for some period of time, the command 'jps' fails to
report any hadoop process on any node in the cluster.  The processes are
still running as can be seen with 'ps -ef|grep java'

In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the
processes to stop.


  




RE: Why must I wait for NameNode?

2010-03-19 Thread Bill Habermaas
At startup, the namenode goes into 'safe' mode to wait for all data nodes to
send block reports on data they are holding.  This is normal for hadoop and
necessary to make sure all replicated data is accounted for across the
cluster.  It is the nature of the beast to work this way for good reasons. 

Bill

-Original Message-
From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu] 
Sent: Friday, March 19, 2010 1:21 PM
To: common-user@hadoop.apache.org
Subject: Why must I wait for NameNode?

What is the namemode doing upon startup? I have to wait about 1 minute 
and watch for the namenode dfs usage drop from 100% otherwise the install 
is unusable. Is this typical? Is something wrong with my install?

I've been attempting the Pseudo distributed tutorial example for a 
while trying to get it to work.  I finally discovered that the namenode 
upon start up is 100% in use and I need to wait about 1 minute before I 
can use it. Is this typical of hadoop installations?

This isn't entirely clear in the tutorial.  I believe that a note should 
be entered if this is typical.  This error caused me to get "WARN 
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could 
only be replicated to 0 nodes, instead of 1"

I had written a script to do all of the steps right in a row.  Now with a 
1 minute wait things work. Is my install atypical or am I doing something 
wrong that is causing this needed wait time.

Thanks,
Nick




RE: Wrong FS

2010-02-22 Thread Bill Habermaas
This problem has been around for a long time. Hadoop picks up the local host
name for the namenode and it will be used in all URI checks.  You cannot mix
IP and host addresses. This is especially a problem on solaris and aix
systems where I ran into it.  You don't need to setup DNS, just use the
hostname in your URIs. I did some patches for this for 0.18 but have not
redone them for 0.20. 

Bill

-Original Message-
From: Edson Ramiro [mailto:erlfi...@gmail.com] 
Sent: Monday, February 22, 2010 8:18 AM
To: common-user@hadoop.apache.org
Subject: Wrong FS

Hi all,

I'm getting this error

[had...@master01 hadoop-0.20.1 ]$ ./bin/hadoop jar
hadoop-0.20.1-examples.jar pi 1 1
Number of Maps  = 1
Samples per Map = 1
Wrote input for Map #0
Starting Job
java.lang.IllegalArgumentException: Wrong FS: hdfs://
10.0.0.101:9000/system/job_201002221311_0001, expected: hdfs://master01:9000

[...]

Do I need to set up a DNS ?

All my nodes are ok and the NameNode isn't in safe mode.

Any Idea?

Thanks in Advance.

Edson Ramiro




RE: Hadoop on a Virtualized O/S vs. the Real O/S

2010-02-08 Thread Bill Habermaas
In my shop we also did certification on different operating platforms. This
was done on virtualized machines for all the Linux variants.  We ran the
Apache hadoop unit tests in each environment and then checked the results.
Overall hadoop runs well but some of the more bizarre lunatic unit tests
will react strangely. 

You will likely see the same issues as we did...

1. Some Networking APIs behave slight differently between Linux and
Solaris/Aix environments. 
2. Windows will encounter many failed tests under cygwin and not in a
consistent manner.  Sometimes a test will work and other times it won't.  I
suspect because cvgwin is not a perfect simulation and race conditions cause
different reactions - depending on the phase of the moon. Oh well, Windows
is not for production anyway 

Bill

-Original Message-
From: Stephen Watt [mailto:sw...@us.ibm.com] 
Sent: Monday, February 08, 2010 2:58 PM
To: common-user@hadoop.apache.org
Subject: Hadoop on a Virtualized O/S vs. the Real O/S

Hi Folks

I need to be able to certify that Hadoop works on various operating 
systems. I do this by running a series it through a series of tests. As 
I'm sure you can empathize, obtaining all the machines for each test run 
can sometimes be tricky. It would be easier for me if I can spin up 
several instances a virtual image of the desired O/S, but to do this, I 
need to know if there are any risks I'm running using that approach.

Is there any reason why Hadoop might work differently on a virtual O/S as 
opposed to running on an actual O/S ? Since just about everything is done 
through the JVM and SSH I don't foresee any issues and I don't believe 
we're doing anything weird with device drivers or have any kernel module 
dependencies.

Kind regards
Steve Watt




RE: setup cluster with cloudera repo

2010-02-03 Thread Bill Habermaas
So you have hadoop installed and not configured/running.
I suggest you visit the hadoop website and review the QuickStart guide. 
You need to understand how to configure the system and then extrapolate to
your situation. 

Bill

-Original Message-
From: Jim Kusznir [mailto:jkusz...@gmail.com] 
Sent: Wednesday, February 03, 2010 2:08 PM
To: common-user
Subject: setup cluster with cloudera repo

Hi all:

I need to set up a hadoop cluster.  The cluster is based on CentOS
5.4, and I already have all the base OSes installed.

I saw that Cloudera had a repo for hadoop CentOS, so I set up that
repo, and installed hadoop via yum.  Unfortunately, I'm now at the
"now what?" question.  Cloudera's website has many links to "confugre
your cluster" or "continue", but that takes one to a page saying
"we're redoing it, come back later".  This leaves me with no
documentation to follow to actually make this cluster work.

How do I proceed?

Thanks!
--Jim




RE: Google has obtained the patent over mapreduce

2010-01-20 Thread Bill Habermaas
It is likely that Google filed the patent as a matter of record for their own 
protection - to make sure someone else could not do the same and put them at 
risk for a patent violation suit. 

Bill

-Original Message-
From: 松柳 [mailto:lamfeeli...@gmail.com] 
Sent: Wednesday, January 20, 2010 3:04 PM
To: common-user@hadoop.apache.org
Subject: Re: Google has obtained the patent over mapreduce

Just want to ask, how about AWS? Many services/programms runing on AWS are
based on M/R mechanism.
Does this mean, they owners of these softeware may be targeted in law, How
about Amazon itself?

Song

2010/1/20 Ravi 

> Do you mean to say companies like yahoo and facebook are taking risk?
>
> On Wed, Jan 20, 2010 at 11:06 PM, Edward Capriolo  >wrote:
>
> > On Wed, Jan 20, 2010 at 12:23 PM, Raymond Jennings III
> >  wrote:
> > > I am not a patent attorney either but for what it's worth - many times
> a
> > patent is sought solely to protect a company from being sued from
> another.
> >  So even though Hadoop is out there it could be the case that Google has
> no
> > intent of suing anyone who uses it - they just wanted to protect
> themselves
> > from someone else claiming it as their own and then suing Google.  But
> yes,
> > the patent system clearly has problems as you stated.
> > >
> > > --- On Wed, 1/20/10, Edward Capriolo  wrote:
> > >
> > >> From: Edward Capriolo 
> > >> Subject: Re: Google has obtained the patent over mapreduce
> > >> To: common-user@hadoop.apache.org
> > >> Date: Wednesday, January 20, 2010, 12:09 PM
> > >> Interesting situation.
> > >>
> > >> I try to compare mapreduce to the camera. Let argue Google
> > >> is Kodak,
> > >> Apache is Polaroid, and MapReduce is a Camera. Imagine
> > >> Kodak invented
> > >> the camera privately, never sold it to anyone, but produced
> > >> some
> > >> document describing what a camera did.
> > >>
> > >> Polaroid followed the document and produced a camera and
> > >> sold it
> > >> publicly. Kodak later patents a camera, even though no one
> > >> outside of
> > >> Kodak can confirm Kodak ever made a camera before
> > >> Polaroid.
> > >>
> > >> Not saying that is what happened here, but google releasing
> > >> the GFS
> > >> pdf was a large factor in causing hadoop to happen.
> > >> Personally, it
> > >> seems like they gave away too much information before they
> > >> had the
> > >> patent.
> > >>
> > >> The patent system faces many problems including this 'back
> > >> to the
> > >> future' issue. Where it takes so long to get a patent no
> > >> one can wait,
> > >> by the time a patent is issued there are already multiple
> > >> viable
> > >> implementations of a patent.
> > >>
> > >> I am no patent layer or anything, but I notice the phrase
> > >> "master
> > >> process" all over the claims. Maybe if a piece of software
> > >> (hadoop)
> > >> had a "distributed process" that would be sufficient to say
> > >> hadoop
> > >> technology does not infringe on this patent.
> > >>
> > >> I think it would be interesting to look deeply at each
> > >> claim and
> > >> determine if hadoop could be designed to not infringe on
> > >> these
> > >> patents, to deal with what if scenarios.
> > >>
> > >>
> > >>
> > >> On Wed, Jan 20, 2010 at 11:29 AM, Ravi <
> ravindra.babu.rav...@gmail.com>
> > >> wrote:
> > >> > Hi,
> > >> >  I too read about that news. I don't think that it
> > >> will be any problem.
> > >> > However Google didn't invent the model.
> > >> >
> > >> > Thanks.
> > >> >
> > >> > On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi 
> > >> wrote:
> > >> >
> > >> >> Hi,
> > >> >>   As an user of hadoop, Is there anything to
> > >> worry about Google obtaining
> > >> >> the patent over mapreduce?
> > >> >>
> > >> >> Thanks.
> > >> >>
> > >> >
> > >>
> > >
> > >
> > >
> > >
> >
> > @Raymond
> >
> > Yes. I agree with you.
> >
> > As we have learned from SCO->linux. Corporate users can become the
> > target of legal action not the technology vendor. This could scare a
> > large corporation away from using hadoop. They take a risk knowing
> > that they could be targeted just for using the software.
> >
>




RE: Why DrWho

2009-12-17 Thread Bill Habermaas
Amen. 

Running shell commands within Hadoop by invoking bash is not what I would
consider a good thing. I had to do a patch sometime back because the DF
command produced different output on AIX which cause Hadoop to think it
didn't have any disk space. I heartily second the notion of an operating
system abstraction layer. 

Bill 

-Original Message-
From: Allen Wittenauer [mailto:awittena...@linkedin.com] 
Sent: Thursday, December 17, 2009 5:48 PM
To: common-user@hadoop.apache.org
Subject: Re: Why DrWho




On 12/17/09 1:36 PM, "Edward Capriolo"  wrote:
> In a nutshell, this is the same problem you face with shell scripting,
> assuming external binary files exist. assuming they take a set of
> arguments, assuming they produce a result code, assuming the output is
> formatted in a specific way.

Yup.  There was a JIRA posted the other day about a shell command break on
Mac OS X (the stat command).   I suspect the same break happens on other BSD
environments.  Ironically, Solaris has GNU stat, so that particular shell
out worked just fine.

Every time we issue a fork(), we risk breaking an OS.  I really wish we'd
give more weight to building some sort of compatibility layer.





SequenceFileAsBinaryOutputFormat for M/R

2009-09-21 Thread Bill Habermaas
Referring to Hadoop 0.20.1 API. 

 

SequenceFileAsBinaryOutputFormat requires JobConf but JobConf is deprecated.


 

Is there another OutputFormat I should be using ?

 

Bill 

 

 



RE: Hadoop on Windows

2009-09-17 Thread Bill Habermaas
It's interesting that Hadoop, being written entirely in Java, has such a
spotty reputation running on different platforms. I had to patch it to run
on AIX and need cygwin (gack!) so it will run on Windows. I'm surprised
nobody has thought about removing it's use of bash to run system commands
(which is NOT especially portable). Now that Hadoop only comes only in a
Java 1.6 flavor why can't it figure out disk space using the native java
runtime instead of executing the DF command under bash? Of course it runs
other system commands as well which in my opinion isn't too cool. 

Bill

-Original Message-
From: Steve Loughran [mailto:ste...@apache.org] 
Sent: Thursday, September 17, 2009 12:53 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop on Windows

brien colwell wrote:
> Our cygwin/windows nodes are picky about the machines they work on. On
> some they are unreliable. On some they work perfectly.
> 
> We've had two main issues with cygwin nodes.
> 
> Hadoop resolves paths in strange ways, so for example /dir is
> interpreted as c:/dir not %cygwin_home%/dir. For SSH to a cygwin node,
> /dir is interpreted as %cygwin_home%/dir. So our maintenance scripts
> have to make a distinction between cygwin and linux to adjust for
> Hadoop's path behavior.
> 

That's exactly the same as any Java File instance would work on windows, 
new File("/dir") would map to c:/dir.

As the Ant team say in their docs
"We get lots of support calls from Cygwin users. Either it is incredibly 
popular, or it is trouble. If you do use it, remember that Java is a 
Windows application, so Ant is running in a Windows process, not a 
Cygwin one. This will save us having to mark your bug reports as invalid. "




RE: IP address or host name

2009-08-24 Thread Bill Habermaas
The problem resolved by HADOOP-5191 involves client connection to the name
node and has nothing to do with connections between the master and slave. 

In my configuration I use IP addresses exclusively.  Mixing hostnames and IP
addresses leads to all kinds of problems. 

Bill 

-Original Message-
From: Nelson, William [mailto:wne...@email.uky.edu] 
Sent: Monday, August 24, 2009 12:26 PM
To: common-user@hadoop.apache.org
Subject: IP address or host name 

I'm new to hadoop.
I'm running 0.19.2 on a Centos 5.2  cluster.
I have been having problems with the nodes connecting to the master (even
when the firewall is off) using the hostname  in the hadoop-site.xml but it
will connect using the IP address.
 This is also true trying to connect to port 9000 with telnet. If I start
hadoop with hostnames in the hadoop-site.xml, I get  Connection refused.
When I use IP addresses in the hadoop-site.xml I can connect with telnet
using either the IP address or hostname.
The datanode running on the master node can connect with either IP address
or hostname in the hadoop-site.xml.
I have found this problem posted a couple of time but have not found the
answer yet.


Datanodes on slaves can't connect but the datanode on master can connect.

fs.default.name
hdfs://master.com:9000
 

Everybody can connect.

fs.default.name
hdfs://192.68.42.221:9000
 

Unfortunately  using IP addresses creates another problem when I try to run
the job: Wrong FS exception


Previous posts refer to https://issues.apache.org/jira/browse/HADOOP-5191
but it appears the work around is to switch back to host names, which I
can't get to work.



Thanks in advance for any help.



Bill