RE: risks of using Hadoop
Amen to that. I haven't heard a good rant in a long time, I am definitely amused end entertained. As a veteran of 3 years with Hadoop I will say that the SPOF issue is whatever you want to make it. But it has not, nor will it ever defer me from using this great system. Every system has its risks and they can be minimized by careful architectural crafting and intelligent usage. Bill -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Wednesday, September 21, 2011 1:48 PM To: common-user@hadoop.apache.org Subject: RE: risks of using Hadoop Kobina The points 1 and 2 are definitely real risks. SPOF is not. As I pointed out in my mini-rant to Tom was that your end users / developers who use the cluster can do more harm to your cluster than a SPOF machine failure. I don't know what one would consider a 'long learning curve'. With the adoption of any new technology, you're talking at least 3-6 months based on the individual and the overall complexity of the environment. Take anyone who is a strong developer, put them through Cloudera's training, plus some play time, and you've shortened the learning curve. The better the java developer, the easier it is for them to pick up Hadoop. I would also suggest taking the approach of hiring a senior person who can cross train and mentor your staff. This too will shorten the runway. HTH -Mike > Date: Wed, 21 Sep 2011 17:02:45 +0100 > Subject: Re: risks of using Hadoop > From: kobina.kwa...@gmail.com > To: common-user@hadoop.apache.org > > Jignesh, > > Will your point 2 still be valid if we hire very experienced Java > programmers? > > Kobina. > > On 20 September 2011 21:07, Jignesh Patel wrote: > > > > > @Kobina > > 1. Lack of skill set > > 2. Longer learning curve > > 3. Single point of failure > > > > > > @Uma > > I am curious to know about .20.2 is that stable? Is it same as the one you > > mention in your email(Federation changes), If I need scaled nameNode and > > append support, which version I should choose. > > > > Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is > > updating the Hadoop API. When that will be integrated with Hadoop. > > > > If I need > > > > > > -Jignesh > > > > On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote: > > > > > Hi Kobina, > > > > > > Some experiences which may helpful for you with respective to DFS. > > > > > > 1. Selecting the correct version. > > >I will recommend to use 0.20X version. This is pretty stable version > > and all other organizations prefers it. Well tested as well. > > > Dont go for 21 version.This version is not a stable version.This is risk. > > > > > > 2. You should perform thorough test with your customer operations. > > > (of-course you will do this :-)) > > > > > > 3. 0.20x version has the problem of SPOF. > > > If NameNode goes down you will loose the data.One way of recovering is > > by using the secondaryNameNode.You can recover the data till last > > checkpoint.But here manual intervention is required. > > > In latest trunk SPOF will be addressed bu HDFS-1623. > > > > > > 4. 0.20x NameNodes can not scale. Federation changes included in latest > > versions. ( i think in 22). this may not be the problem for your cluster. > > But please consider this aspect as well. > > > > > > 5. Please select the hadoop version depending on your security > > requirements. There are versions available for security as well in 0.20X. > > > > > > 6. If you plan to use Hbase, it requires append support. 20Append has the > > support for append. 0.20.205 release also will have append support but not > > yet released. Choose your correct version to avoid sudden surprises. > > > > > > > > > > > > Regards, > > > Uma > > > - Original Message - > > > From: Kobina Kwarko > > > Date: Saturday, September 17, 2011 3:42 am > > > Subject: Re: risks of using Hadoop > > > To: common-user@hadoop.apache.org > > > > > >> We are planning to use Hadoop in my organisation for quality of > > >> servicesanalysis out of CDR records from mobile operators. We are > > >> thinking of having > > >> a small cluster of may be 10 - 15 nodes and I'm preparing the > > >> proposal. my > > >> office requires that i provide some risk analysis in the proposal. > > >> > > >> thank you. > > >> > > >> On 16 September 2011 20:34, Uma Maheswara Rao G 72686 > > >> wrote: > > >> > > >>> Hello, > > >>> > > >>> First of all where you are planning to use Hadoop? > > >>> > > >>> Regards, > > >>> Uma > > >>> - Original Message - > > >>> From: Kobina Kwarko > > >>> Date: Saturday, September 17, 2011 0:41 am > > >>> Subject: risks of using Hadoop > > >>> To: common-user > > >>> > > Hello, > > > > Please can someone point some of the risks we may incur if we > > decide to > > implement Hadoop? > > > > BR, > > > > Isaac. > > > > >>> > > >> > > > >
Re: Fundamental question
These questions are usually answered once you start using the system but I'll provide some quick answers. 1. Hadoop uses the local file system at each node to store blocks. The only part of the system that needs to be formatted is the namenode which is where Hadoop keeps track of the logical HDFS filesystem image that contains the directory structure, files and the datanodes where they reside. A file in HDFS is a sequence of blocks. When the file has a replication factor (usually 3) then each block has 3 exact copies that reside at different datanodes. This is important to remember for your second question. 2. The notion of processing locally is simply that map/reduce will process a file at different nodes by reading the blocks that are located at that location. So if you have 3 copies of the same block at different nodes, then the system can pick nodes where it can process those blocks locally. In order to process the entire file, map/reduce runs parallel tasks that process the blocks locally at each node. Once you have data in the HDFS cluster it is not necessary to move things around. The framework does that transparently. An example might help: say file has blocks 1,2,3,4 which are replicated across 3 datanodes (A,B,C). Due to replication there is a copy of each block residing at each node. When the map/reduce job is started by the jobtracker, it begins a task at each node: (A will process blocks 1 & 2), B will process block 3, and C will process block 4). All these tasks run in parallel so if you are handling a terrabyte+ file there is a big reduction in processing time. Each task writes it's map/reduce output to a specific output directory (in this case 3 files) which can be inputted to the next map/reduce job. I hope this brief answer is helpful and provides some insight. Bill - Original Message - From: "Vijay Rao" To: Sent: Sunday, May 09, 2010 2:49 AM Subject: Fundamental question Hello, I am just reading and understanding Hadoop and all the other components. However I have a fundamental question for which I am not getting answers in any of the online material that is out there. 1) If hadoop is used then all the slaves and other machines in the cluster need to be formatted to have HDFS file system. If so what happens to the tera bytes of data that need to be crunched? Or is the data on a different machine? 2) Everywhere it is mentioned that the main advantage of map/reduce and hadoop is that it runs on data that is available locally. So does this mean that once the file system is formatted then I have to move my terabytes of data and split them across the cluster? Thanks VJ
RE: How can I syncronize writing to an hdfs file
I had a similar requirement. Hdfs has no locking that I am aware of, at least I have never run across it in reading the source. My solution was to build a distributed locking mechanism using ZooKeeper. You might want to visit http://hadoop.apache.org/zookeeper/docs/current/recipes.html For some ideas. The code you find there is a start but buggy. Bill -Original Message- From: Raymond Jennings III [mailto:raymondj...@yahoo.com] Sent: Friday, May 07, 2010 10:32 AM To: common-user@hadoop.apache.org Subject: How can I syncronize writing to an hdfs file I want to write to a common hdfs file from within my map method. Given that each task runs in a separate jvm (on separate machines) making a method syncronized will not work I assume. Are there any file locking or other methods to guarantee mutual exclusion on hdfs? (I want to append to this file and I have the append option turned on.) Thanks.
RE: why does 'jps' lose track of hadoop processes ?
Sounds like your pid files are getting cleaned out of whatever directory they are being written (maybe garbage collection on a temp directory?). Look at (taken from hadoop-env.sh): # The directory where pid files are stored. /tmp by default. # export HADOOP_PID_DIR=/var/hadoop/pids The hadoop shell scripts look in the directory that is defined. Bill -Original Message- From: Raymond Jennings III [mailto:raymondj...@yahoo.com] Sent: Monday, March 29, 2010 11:37 AM To: common-user@hadoop.apache.org Subject: why does 'jps' lose track of hadoop processes ? After running hadoop for some period of time, the command 'jps' fails to report any hadoop process on any node in the cluster. The processes are still running as can be seen with 'ps -ef|grep java' In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the processes to stop.
RE: Why must I wait for NameNode?
At startup, the namenode goes into 'safe' mode to wait for all data nodes to send block reports on data they are holding. This is normal for hadoop and necessary to make sure all replicated data is accounted for across the cluster. It is the nature of the beast to work this way for good reasons. Bill -Original Message- From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu] Sent: Friday, March 19, 2010 1:21 PM To: common-user@hadoop.apache.org Subject: Why must I wait for NameNode? What is the namemode doing upon startup? I have to wait about 1 minute and watch for the namenode dfs usage drop from 100% otherwise the install is unusable. Is this typical? Is something wrong with my install? I've been attempting the Pseudo distributed tutorial example for a while trying to get it to work. I finally discovered that the namenode upon start up is 100% in use and I need to wait about 1 minute before I can use it. Is this typical of hadoop installations? This isn't entirely clear in the tutorial. I believe that a note should be entered if this is typical. This error caused me to get "WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could only be replicated to 0 nodes, instead of 1" I had written a script to do all of the steps right in a row. Now with a 1 minute wait things work. Is my install atypical or am I doing something wrong that is causing this needed wait time. Thanks, Nick
RE: Wrong FS
This problem has been around for a long time. Hadoop picks up the local host name for the namenode and it will be used in all URI checks. You cannot mix IP and host addresses. This is especially a problem on solaris and aix systems where I ran into it. You don't need to setup DNS, just use the hostname in your URIs. I did some patches for this for 0.18 but have not redone them for 0.20. Bill -Original Message- From: Edson Ramiro [mailto:erlfi...@gmail.com] Sent: Monday, February 22, 2010 8:18 AM To: common-user@hadoop.apache.org Subject: Wrong FS Hi all, I'm getting this error [had...@master01 hadoop-0.20.1 ]$ ./bin/hadoop jar hadoop-0.20.1-examples.jar pi 1 1 Number of Maps = 1 Samples per Map = 1 Wrote input for Map #0 Starting Job java.lang.IllegalArgumentException: Wrong FS: hdfs:// 10.0.0.101:9000/system/job_201002221311_0001, expected: hdfs://master01:9000 [...] Do I need to set up a DNS ? All my nodes are ok and the NameNode isn't in safe mode. Any Idea? Thanks in Advance. Edson Ramiro
RE: Hadoop on a Virtualized O/S vs. the Real O/S
In my shop we also did certification on different operating platforms. This was done on virtualized machines for all the Linux variants. We ran the Apache hadoop unit tests in each environment and then checked the results. Overall hadoop runs well but some of the more bizarre lunatic unit tests will react strangely. You will likely see the same issues as we did... 1. Some Networking APIs behave slight differently between Linux and Solaris/Aix environments. 2. Windows will encounter many failed tests under cygwin and not in a consistent manner. Sometimes a test will work and other times it won't. I suspect because cvgwin is not a perfect simulation and race conditions cause different reactions - depending on the phase of the moon. Oh well, Windows is not for production anyway Bill -Original Message- From: Stephen Watt [mailto:sw...@us.ibm.com] Sent: Monday, February 08, 2010 2:58 PM To: common-user@hadoop.apache.org Subject: Hadoop on a Virtualized O/S vs. the Real O/S Hi Folks I need to be able to certify that Hadoop works on various operating systems. I do this by running a series it through a series of tests. As I'm sure you can empathize, obtaining all the machines for each test run can sometimes be tricky. It would be easier for me if I can spin up several instances a virtual image of the desired O/S, but to do this, I need to know if there are any risks I'm running using that approach. Is there any reason why Hadoop might work differently on a virtual O/S as opposed to running on an actual O/S ? Since just about everything is done through the JVM and SSH I don't foresee any issues and I don't believe we're doing anything weird with device drivers or have any kernel module dependencies. Kind regards Steve Watt
RE: setup cluster with cloudera repo
So you have hadoop installed and not configured/running. I suggest you visit the hadoop website and review the QuickStart guide. You need to understand how to configure the system and then extrapolate to your situation. Bill -Original Message- From: Jim Kusznir [mailto:jkusz...@gmail.com] Sent: Wednesday, February 03, 2010 2:08 PM To: common-user Subject: setup cluster with cloudera repo Hi all: I need to set up a hadoop cluster. The cluster is based on CentOS 5.4, and I already have all the base OSes installed. I saw that Cloudera had a repo for hadoop CentOS, so I set up that repo, and installed hadoop via yum. Unfortunately, I'm now at the "now what?" question. Cloudera's website has many links to "confugre your cluster" or "continue", but that takes one to a page saying "we're redoing it, come back later". This leaves me with no documentation to follow to actually make this cluster work. How do I proceed? Thanks! --Jim
RE: Google has obtained the patent over mapreduce
It is likely that Google filed the patent as a matter of record for their own protection - to make sure someone else could not do the same and put them at risk for a patent violation suit. Bill -Original Message- From: 松柳 [mailto:lamfeeli...@gmail.com] Sent: Wednesday, January 20, 2010 3:04 PM To: common-user@hadoop.apache.org Subject: Re: Google has obtained the patent over mapreduce Just want to ask, how about AWS? Many services/programms runing on AWS are based on M/R mechanism. Does this mean, they owners of these softeware may be targeted in law, How about Amazon itself? Song 2010/1/20 Ravi > Do you mean to say companies like yahoo and facebook are taking risk? > > On Wed, Jan 20, 2010 at 11:06 PM, Edward Capriolo >wrote: > > > On Wed, Jan 20, 2010 at 12:23 PM, Raymond Jennings III > > wrote: > > > I am not a patent attorney either but for what it's worth - many times > a > > patent is sought solely to protect a company from being sued from > another. > > So even though Hadoop is out there it could be the case that Google has > no > > intent of suing anyone who uses it - they just wanted to protect > themselves > > from someone else claiming it as their own and then suing Google. But > yes, > > the patent system clearly has problems as you stated. > > > > > > --- On Wed, 1/20/10, Edward Capriolo wrote: > > > > > >> From: Edward Capriolo > > >> Subject: Re: Google has obtained the patent over mapreduce > > >> To: common-user@hadoop.apache.org > > >> Date: Wednesday, January 20, 2010, 12:09 PM > > >> Interesting situation. > > >> > > >> I try to compare mapreduce to the camera. Let argue Google > > >> is Kodak, > > >> Apache is Polaroid, and MapReduce is a Camera. Imagine > > >> Kodak invented > > >> the camera privately, never sold it to anyone, but produced > > >> some > > >> document describing what a camera did. > > >> > > >> Polaroid followed the document and produced a camera and > > >> sold it > > >> publicly. Kodak later patents a camera, even though no one > > >> outside of > > >> Kodak can confirm Kodak ever made a camera before > > >> Polaroid. > > >> > > >> Not saying that is what happened here, but google releasing > > >> the GFS > > >> pdf was a large factor in causing hadoop to happen. > > >> Personally, it > > >> seems like they gave away too much information before they > > >> had the > > >> patent. > > >> > > >> The patent system faces many problems including this 'back > > >> to the > > >> future' issue. Where it takes so long to get a patent no > > >> one can wait, > > >> by the time a patent is issued there are already multiple > > >> viable > > >> implementations of a patent. > > >> > > >> I am no patent layer or anything, but I notice the phrase > > >> "master > > >> process" all over the claims. Maybe if a piece of software > > >> (hadoop) > > >> had a "distributed process" that would be sufficient to say > > >> hadoop > > >> technology does not infringe on this patent. > > >> > > >> I think it would be interesting to look deeply at each > > >> claim and > > >> determine if hadoop could be designed to not infringe on > > >> these > > >> patents, to deal with what if scenarios. > > >> > > >> > > >> > > >> On Wed, Jan 20, 2010 at 11:29 AM, Ravi < > ravindra.babu.rav...@gmail.com> > > >> wrote: > > >> > Hi, > > >> > I too read about that news. I don't think that it > > >> will be any problem. > > >> > However Google didn't invent the model. > > >> > > > >> > Thanks. > > >> > > > >> > On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi > > >> wrote: > > >> > > > >> >> Hi, > > >> >> As an user of hadoop, Is there anything to > > >> worry about Google obtaining > > >> >> the patent over mapreduce? > > >> >> > > >> >> Thanks. > > >> >> > > >> > > > >> > > > > > > > > > > > > > > > > @Raymond > > > > Yes. I agree with you. > > > > As we have learned from SCO->linux. Corporate users can become the > > target of legal action not the technology vendor. This could scare a > > large corporation away from using hadoop. They take a risk knowing > > that they could be targeted just for using the software. > > >
RE: Why DrWho
Amen. Running shell commands within Hadoop by invoking bash is not what I would consider a good thing. I had to do a patch sometime back because the DF command produced different output on AIX which cause Hadoop to think it didn't have any disk space. I heartily second the notion of an operating system abstraction layer. Bill -Original Message- From: Allen Wittenauer [mailto:awittena...@linkedin.com] Sent: Thursday, December 17, 2009 5:48 PM To: common-user@hadoop.apache.org Subject: Re: Why DrWho On 12/17/09 1:36 PM, "Edward Capriolo" wrote: > In a nutshell, this is the same problem you face with shell scripting, > assuming external binary files exist. assuming they take a set of > arguments, assuming they produce a result code, assuming the output is > formatted in a specific way. Yup. There was a JIRA posted the other day about a shell command break on Mac OS X (the stat command). I suspect the same break happens on other BSD environments. Ironically, Solaris has GNU stat, so that particular shell out worked just fine. Every time we issue a fork(), we risk breaking an OS. I really wish we'd give more weight to building some sort of compatibility layer.
SequenceFileAsBinaryOutputFormat for M/R
Referring to Hadoop 0.20.1 API. SequenceFileAsBinaryOutputFormat requires JobConf but JobConf is deprecated. Is there another OutputFormat I should be using ? Bill
RE: Hadoop on Windows
It's interesting that Hadoop, being written entirely in Java, has such a spotty reputation running on different platforms. I had to patch it to run on AIX and need cygwin (gack!) so it will run on Windows. I'm surprised nobody has thought about removing it's use of bash to run system commands (which is NOT especially portable). Now that Hadoop only comes only in a Java 1.6 flavor why can't it figure out disk space using the native java runtime instead of executing the DF command under bash? Of course it runs other system commands as well which in my opinion isn't too cool. Bill -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Thursday, September 17, 2009 12:53 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop on Windows brien colwell wrote: > Our cygwin/windows nodes are picky about the machines they work on. On > some they are unreliable. On some they work perfectly. > > We've had two main issues with cygwin nodes. > > Hadoop resolves paths in strange ways, so for example /dir is > interpreted as c:/dir not %cygwin_home%/dir. For SSH to a cygwin node, > /dir is interpreted as %cygwin_home%/dir. So our maintenance scripts > have to make a distinction between cygwin and linux to adjust for > Hadoop's path behavior. > That's exactly the same as any Java File instance would work on windows, new File("/dir") would map to c:/dir. As the Ant team say in their docs "We get lots of support calls from Cygwin users. Either it is incredibly popular, or it is trouble. If you do use it, remember that Java is a Windows application, so Ant is running in a Windows process, not a Cygwin one. This will save us having to mark your bug reports as invalid. "
RE: IP address or host name
The problem resolved by HADOOP-5191 involves client connection to the name node and has nothing to do with connections between the master and slave. In my configuration I use IP addresses exclusively. Mixing hostnames and IP addresses leads to all kinds of problems. Bill -Original Message- From: Nelson, William [mailto:wne...@email.uky.edu] Sent: Monday, August 24, 2009 12:26 PM To: common-user@hadoop.apache.org Subject: IP address or host name I'm new to hadoop. I'm running 0.19.2 on a Centos 5.2 cluster. I have been having problems with the nodes connecting to the master (even when the firewall is off) using the hostname in the hadoop-site.xml but it will connect using the IP address. This is also true trying to connect to port 9000 with telnet. If I start hadoop with hostnames in the hadoop-site.xml, I get Connection refused. When I use IP addresses in the hadoop-site.xml I can connect with telnet using either the IP address or hostname. The datanode running on the master node can connect with either IP address or hostname in the hadoop-site.xml. I have found this problem posted a couple of time but have not found the answer yet. Datanodes on slaves can't connect but the datanode on master can connect. fs.default.name hdfs://master.com:9000 Everybody can connect. fs.default.name hdfs://192.68.42.221:9000 Unfortunately using IP addresses creates another problem when I try to run the job: Wrong FS exception Previous posts refer to https://issues.apache.org/jira/browse/HADOOP-5191 but it appears the work around is to switch back to host names, which I can't get to work. Thanks in advance for any help. Bill