from:"Michael Segel"

Re: rack awarness unexpected behaviour

2013-10-03 Thread Michael Segel

Marc,

The rack aware script is an artificial concept. Meaning you can tell which
machine is in which rack and that may or may not reflect where the machine is
actually located.
The idea is to balance the number of nodes in the racks, at least on paper. So
you can have 14 machines in rack 1, and 16 machines in rack 2 even though they
may physically be 20 machines in rack 1 and 10 machines in rack 2.

HTH

-Mike

On Oct 3, 2013, at 2:52 AM, Marc Sturlese marc.sturl...@gmail.com wrote:

I've check it out and it works like that. The problem is, if the two racks
have not the same capacity, one will have the disk space filled up much
faster than the other (that's what I'm seeing).
If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and
the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each,
rack A will get filled up faster as rack B is writing more (because has more
reduce slots).

Could a solution be to modify the bash script used to decide to which
replica write a block? Would use probability and give to rack B double
chance to receive de write.

--
View this message in context:
http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

The opinions expressed here are mine, while they may reflect a cognitive
thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com

Re: how to get the time of a hadoop cluster, v0.20.2

2013-05-18 Thread Michael Segel

Then you have a problem where the solution is more of people management and not 
technical. 

All of your servers should be using NTP.  At a minimum, you have one server 
that gets the time from a national (government) time server, and then have all 
of the machines in that Data Center use that machine as its NTP server, or you 
can have all machines by default use the government server for NTP. 

You can also buy your own clock server that syncs to either GPS or national 
time servers via a radio signal.

But you have a problem of staff that is either unwilling or unable to do their 
job. 

You can either take a carrot or a stick approach. 
I suggest that maybe bribing them with a bottle of scotch. (That seems to be 
the current liquid lubricator that works universally these days, unless of 
course they don't drink...) 

HTH

-Mike

On May 17, 2013, at 9:13 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

 and please remember, i stated that although the hadoop cluster uses NTP,
 the server (the machine that is not a part of the hadoop cluster) cannot
 assume to be using NTP (and in fact, doesn't).
 
 
 On Fri, May 17, 2013 at 10:10 AM, Jane Wayne jane.wayne2...@gmail.comwrote:
 
 if NTP is correclty used
 
 that's the key statement. in several of our clusters, NTP setup is kludgy.
 note that the professionals administering the cluster are different from
 us the engineers. so, there's a lot of red tape to go through to get
 something trivial or not fixed. we have noticed that NTP is not setup
 correctly (using default GMT timezone, for example). without explaining all
 the tedious details, this mismatch of date/time (of the hadoop cluster to
 the server machine) is causing some pains.
 
 i'm not sure i agree with the local OS time from your server machine will
 be the best estimation. that doesn't make sense.
 
 but what i want to achieve is very simple. as stated before, i just want
 to ask the namenode or jobtracker, hey, what date/time do you have?
 unfortunately for me, as niels pointed out, this query is not possible via
 the hadoop api.
 
 thanks for helping, though.
 
 :)
 
 
 On Fri, May 17, 2013 at 10:02 AM, Bertrand Dechoux decho...@gmail.comwrote:
 
 For hadoop, 'cluster time' is the local OS time. You might want to get the
 time of the namenode machine but indeed if NTP is correctly used, the
 local
 OS time from your server machine will be the best estimation. If you
 request the time from the namenode machine, you will be penalized by the
 delay of your request.
 
 Regards
 
 Bertrand
 
 
 On Fri, May 17, 2013 at 3:17 PM, Niels Basjes ni...@basjes.nl wrote:
 
 Hi,
 
 i have another computer (which i have referred to as a server, since
 it
 is
 running tomcat), and this computer is NOT a part of the hadoop cluster
 (it
 doesn't run any of the hadoop daemons), but does submit jobs to the
 hadoop
 cluster via a JEE webapp interface. i need to check that the time on
 this
 computer is in sync with the time on the hadoop cluster. when i say
 check
 that the time is in sync, there is a defined tolerance/threshold
 difference in date/time that i am willing to accept (e.g. the
 date/time
 should be the same down to the minute).
 
 If you ensure (using NTP) that all your servers have the same time then
 you
 can simply query your local server for the time and you have the correct
 answer to your question.
 
 You are searching for a solution in the Hadoop API (where this does not
 exist) when the solution is present at a different level.
 
 --
 Best regards / Met vriendelijke groeten,
 
 Niels Basjes

Re: How to handle sensitive data

2013-02-15 Thread Michael Segel

Simple, have your app encrypt the field prior to writing to HDFS. 

Also consider HBase.

On Feb 14, 2013, at 10:35 AM, abhishek abhishek.dod...@gmail.com wrote:

 
 Hi all,
 
 we are having some sensitive data, in some particular fields(columns). Can I 
 know how to handle sensitive in Hadoop.
 
 How do different people handle sensitive data in Hadoop.
 
 Thanks
 Abhi
 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: NN Memory Jumps every 1 1/2 hours

2012-12-22 Thread Michael Segel

Hey Silly question... 

How long have you had 27 million files? 

I mean can you correlate the number of files to the spat of OOMs? 

Even without problems... I'd say it would be a good idea to upgrade due to the 
probability of a lot of code fixes... 

If you're running anything pre 1.x, going to 1.7 java wouldn't be a good idea.  
Having said that... outside of MapR, have any of the distros certified 
themselves on 1.7 yet? 

On Dec 22, 2012, at 6:54 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 I will give this a go. I have actually went in JMX and manually triggered
 GC no memory is returned. So I assumed something was leaking.
 
 On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris afa...@linkedin.com wrote:
 
 I know this will sound odd, but try reducing your heap size.   We had an
 issue like this where GC kept falling behind and we either ran out of heap
 or would be in full gc.  By reducing heap, we were forcing concurrent mark
 sweep to occur and avoided both full GC and running out of heap space as
 the JVM would collect objects more frequently.
 
 On Dec 21, 2012, at 8:24 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 I have an old hadoop 0.20.2 cluster. Have not had any issues for a while.
 (which is why I never bothered an upgrade)
 
 Suddenly it OOMed last week. Now the OOMs happen periodically. We have a
 fairly large NameNode heap Xmx 17GB. It is a fairly large FS about
 27,000,000 files.
 
 So the strangest thing is that every 1 and 1/2 hour the NN memory usage
 increases until the heap is full.
 
 http://imagebin.org/240287
 
 We tried failing over the NN to another machine. We change the Java
 version
 from 1.6_23 - 1.7.0.
 
 I have set the NameNode logs to debug and ALL and I have done the same
 with
 the data nodes.
 Secondary NN is running and shipping edits and making new images.
 
 I am thinking something has corrupted the NN MetaData and after enough
 time
 it becomes a time bomb, but this is just a total shot in the dark. Does
 anyone have any interesting trouble shooting ideas?

Re: Disks RAID best practice

2012-11-01 Thread Michael Segel

Oleg, that's for an overall raid preference. 

Specifically for the 'control nodes' aka (NN, SN, JT, HM, ZK...) 

I tend to just use simple mirroring because these processes are not really I/O 
bound. (RAID-1). 
I guess you could go RAID-10 (Stripe and Mirrored) but that may be a little 
overkill and my preference comes from working in the RDBMS world. 

If we are using commodity servers, JBOD tends to be the preferred way of 
handling things. 

However, I've seen cases where people will use RAIDed Drives on a node for a 
couple of reasons. The nice thing about doing mirrored DN drives is that if you 
have a disk failure you just pop the drive and replace it.  Much simpler.

If we're looking at using NetApp's E Series in conjunction with a compute 
cluster, then you are using their raided configuration and can reduce the 
cluster's replication factor to 2 from 3. 

While its easy to recommend RAID on the control nodes, data nodes is a bit 
trickier.  I mean you can run with straight JBOD and based on a cost issue, its 
the cheapest in terms of hardware.  If you go with RAID on the DN, you reduce 
your storage density per node because you have redundancy in hardware. And this 
has an impact on your overall machine density and TCO.   This is offset by 
easier and faster recovery time from some hardware failure events. Lets face 
it, the number one thing to fail is going to be your hard drives.  So we are 
going to have to balance the costs against the benefits. 

Now I have to state the obvious caveats... 1) YMMV, 2) The factors which go in 
to the cluster design decision are going to be unique to the company  setting 
up the cluster.  

These are IMHO, and you know what they say about opinions... ;-) 

HTH 
-Mike

On Nov 1, 2012, at 7:52 AM, Oleg Ruchovets oruchov...@gmail.com wrote:

 Do you mean RAID 10 for Master Node?
 What about DataNode?
 
 Thanks
 Oleg.
 
 
 
 On Thu, Nov 1, 2012 at 2:43 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
 I prefer RAID 10, but some say RAID 6.
 
 I thought NetApp used RAID 6 ?
 
 Its definitely an interesting discussion point though.
 
 -Mike
 
 On Nov 1, 2012, at 7:37 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
 
 Hi ,
  What is the best practice for DISKS RAID  (Master and Data Nodes).
 Thanks in advance
 Oleg.

Re: measuring iops

2012-10-23 Thread Michael Segel

You have two issues. 

1) You need to know the throughput in terms of data transfer between disks and 
controller cards on the node.

2) The actual network throughput of having all of the nodes talking to one 
another as fast as they can. This will let you see your real limitations in the 
ToR Switch's fabric.

Not sure why you really want to do this except to test the disk, disk 
controller, and then networking infrastructure of your ToR and then your 
backplane to connect multiple racks


HTH

-Mike

On Oct 23, 2012, at 7:47 AM, Ravi Prakash ravi...@ymail.com wrote:

 Do you mean in a cluster being used by users, or as a benchmark to measure 
 the maximum?
 
 The JMX page nn:port/jmx provides some interesting stats, but I'm not sure 
 they have what you want. And I'm unaware of other tools which could.
 
 
 
 
 
 
 From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org; Ravi Prakash ravi...@ymail.com 
 Sent: Monday, October 22, 2012 6:46 PM
 Subject: Re: measuring iops
 
 Is it possible to know how many reads and writes are occurring thru the
 entire cluster in a consolidated manner -- this does not include
 replication factors.
 
 
 On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com wrote:
 
 Hi Rita,
 
 SliveTest can help you measure the number of reads / writes / deletes / ls
 / appends per second your NameNode can handle.
 
 DFSIO can be used to help you measure the amount of throughput.
 
 Both these tests are actually very flexible and have a plethora of options
 to help you test different facets of performance. In my experience, you
 actually have to be very careful and understand what the tests are doing
 for the results to be sensible.
 
 HTH
 Ravi
 
 
 
 
 
   From: Rita rmorgan...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Monday, October 22, 2012 7:23 AM
 Subject: Re: measuring iops
 
 Anyone?
 
 
 On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote:
 
 Hi,
 
 Was curious if there was a method to measure the total number of IOPS
 (I/O
 operations per second) on a HDFS cluster.
 
 
 
 --
 --- Get your facts first, then you can distort them as you please.--
 
 
 
 
 --
 --- Get your facts first, then you can distort them as you please.--
 
 
 
 
 -- 
 --- Get your facts first, then you can distort them as you please.--

Re: reg hadoop on AWS

2012-10-05 Thread Michael Segel

Actually the best way to do it is to use EMR since everything is then built for 
you. 


On Oct 5, 2012, at 7:21 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

 Hi Sudha,
 
 best way to use hadoop on aws is via whirr
 
 Thanks,
 nitin
 
 On Fri, Oct 5, 2012 at 4:45 PM, sudha sadhasivam
 sudhasadhasi...@yahoo.com wrote:
 Sir
 We tried to setup hadoop on AWS. The procedure is given. We face problem 
 with the parameters needed for input and output files. Can somebody provide 
 us with a sample exercise with steps for working on hadoop in AWS?
 thanking you
 Dr G Sudha
 
 
 
 -- 
 Nitin Pawar

Re: Which hardware to choose

2012-10-03 Thread Michael Segel

Well... 

If you're not running HBase, you're less harmed by minimal swapping so you 
could push the number of slots and over subscribe. 
The only thing I would have to suggest is that you monitor your system closely 
as you adjust the number of slots.

You have to admit though, its fun to tune the cluster. :-)

On Oct 3, 2012, at 12:09 PM, J. Rottinghuis jrottingh...@gmail.com wrote:

 Of course it all depends...
 But something like this could work:
 
 Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
 Plan 3-4 GB for Datanode and Tasktracker each
 
 Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
 or less memory per slot.
 Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
 you run).
 
 As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.
 
 With those initial rules of thumb you'd arrive somewhere between
 10 mappers + 5 reducers
 and
 9 mappers + 4 reducers
 
 Try, test, measure, adjust, rinse, repeat.
 
 Cheers,
 
 Joep
 
 On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov 
 apivova...@gmail.comwrote:
 
 All configs are per node.
 No HBase, only Hive and Pig installed
 
 On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 I think he's saying that its 24 maps 8 reducers per node and at 48GB that
 could be too many mappers.
 Especially if they want to run HBase.
 
 On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:
 
 Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
 right?  Sounds VERY low for a cluster that size.
 
 We have only 10 c2100's and are running I believe 140 map and 70 reduce
 slots so far with pretty decent performance.
 
 
 
 On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
 38 data nodes + 2 Name Nodes
 
 Data Node:
 Dell PowerEdge C2100 series
 2 x XEON x5670
 48 GB RAM ECC  (12x4GB 1333MHz)
 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
 Intel Gigabit ET Dual port PCIe x4
 Redundant Power Supply
 Hadoop CDH3
 max map tasks 24
 max reduce tasks 8

Re: Which hardware to choose

2012-10-02 Thread Michael Segel

I think he's saying that its 24 maps 8 reducers per node and at 48GB that could 
be too many mappers. 
Especially if they want to run HBase. 

On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:

 Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right? 
  Sounds VERY low for a cluster that size.
 
 We have only 10 c2100's and are running I believe 140 map and 70 reduce slots 
 so far with pretty decent performance.
 
 
 
 On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
 38 data nodes + 2 Name Nodes
   
 Data Node:
 Dell PowerEdge C2100 series
 2 x XEON x5670
 48 GB RAM ECC  (12x4GB 1333MHz)
 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
 Intel Gigabit ET Dual port PCIe x4
 Redundant Power Supply
 Hadoop CDH3
 max map tasks 24
 max reduce tasks 8

Re: Learning about different distributions of Hadoop

2012-08-08 Thread Michael Segel

Now that's a loaded question. 

I'm going to plead the 5th because no matter how I answer it, I will probably 
piss someone off. ;-P
They all have their own respective strengths and weaknesses. 
(Like that's stopped me before. ;-) 

-Mike

On Aug 8, 2012, at 10:53 AM, Harit Himanshu harit.subscripti...@gmail.com 
wrote:

 Hello
 
 I have a very basic question - There are various flavors of hadoop by Apache, 
 Cloudera, MapR, HortonWorks(may be more I am not aware of).
 I would like to learn what are the differences between these distributions 
 and how do I know which distribution is best for me?
 I am current using Apache Hadoop
 
 Thank you
 + Harit

Re: Learning about different distributions of Hadoop

2012-08-08 Thread Michael Segel

Well  Scott kinda side stepped MapR in his response. ;-) 


On Aug 8, 2012, at 11:20 AM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.com 
wrote:

 I agree with Scoot! The first question to answer is Do you have a problem
 with your current distribution?
 
 
 Thanks
 
 Serge
 On 8/8/12 9:17 AM, Scott Fines scottfi...@gmail.com wrote:
 
 That's a bit like asking people what the best Linux Distro is..they all
 serve (mostly) the same function, and you're likely to start a religious
 war by stating their differences.
 
 The main point running through all the different flavors of Hadoop is
 that they are all Hadoop. The differences only come from the chosen patch
 sets, which are all open-sourced anyway. At least in theory, you could
 rebuild Cloudera/Hortonworks/whatever just by applying the right
 sequences of patch sets to core Hadoop.
 
 The real question is: Are you happy with what you are currently using? If
 so, why worry about it? If not, why are you unhappy? Answering that
 question is likely to give you the guidance you would like in terms of
 what flavor you wish to pick.
 
 Scott
 
 On Aug 8, 2012, at 11:10 AM, Michael Segel wrote:
 
 Now that's a loaded question.
 
 I'm going to plead the 5th because no matter how I answer it, I will
 probably piss someone off. ;-P
 They all have their own respective strengths and weaknesses.
 (Like that's stopped me before. ;-)
 
 -Mike
 
 On Aug 8, 2012, at 10:53 AM, Harit Himanshu
 harit.subscripti...@gmail.com wrote:
 
 Hello
 
 I have a very basic question - There are various flavors of hadoop by
 Apache, Cloudera, MapR, HortonWorks(may be more I am not aware of).
 I would like to learn what are the differences between these
 distributions and how do I know which distribution is best for me?
 I am current using Apache Hadoop
 
 Thank you
 + Harit

Re: Learning about different distributions of Hadoop

2012-08-08 Thread Michael Segel

Uhm... I'd take that report with a grain of salt. ;-)


On Aug 8, 2012, at 4:16 PM, Alberto Ortiz aort...@gmail.com wrote:

 I recently stepped into this Forrester report, may give you more
 information to consider for a decision:
 
 http://www.forrester.com/dl/The+Forrester+Wave+Enterprise+Hadoop+Solutions+Q1+2012/-/E-RES60755/PDF?oid=1-KCJIQAaction=PDFobjectid=RES60755
 
 Regards
 
 On Wed, Aug 8, 2012 at 2:32 PM, Harit Himanshu
 harit.subscripti...@gmail.com wrote:
 nah! I am happy with Apache Hadoop as I am learning but saw so many 
 distributions so wanted to know about them
 On Aug 8, 2012, at 9:10 AM, Michael Segel wrote:
 
 hey all have their own respective strengths and wea

Re: migrate cluster to different datacenter

2012-08-07 Thread Michael Segel

The OP hasn't provided enough information to even start trying to make a real 
recommendation on how to solve this problem. 

On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani bumble@gmail.com wrote:

 Given the size of data, there can be several approaches here:
 
 1. Moving the boxes
 
 Not possible, as I suppose the data must be needed for 24x7 analytics.
 
 2. Mirroring the data.
 
 This is a good solution. However, if you have data being written/removed
 continuously (if a part of live system), there are chances of losing some
 of the data during mirroring happens, unless
 a) You block writes/updates during that time (if you do so, that would be
 as good as unplugging and moving the machine around), or,
 b) Keep a track of what was modified since you started the mirroring
 process.
 
 I would recommend you to go with 2b) because it minimizes downtime. Here is
 how I think you can do it, by using some of the tools provided by Hadoop
 itself.
 
 a) You can use some fast distributed copying tool to copy large chunks of
 data. Before you kick-off with this, you can create a utility that tracks
 the modification of data made to your live system while copying is going on
 in the background. The utility will log the modifications into an audit
 trail.
 b) Once you're done copying the files,  allow the new data store
 replication to catch up by reading the real-time modifications that were
 made, from your utility's log file. Once sync'ed up you can begin with the
 minimal downtime by switching off the JobTracker in live cluster so that
 new files are not created.
 c) As soon as you reach the last chunk of copying, change the DNS entries
 so that the hostnames referenced by the Hadoop jobs points to the new
 location.
 d) Turn on the JobTracker for the new cluster.
 e) Enjoy a drink with the money you saved by not using other paid third
 party solutions and pat your back! ;)
 
 The key of the above solution is to make data copying of step a) as fast as
 possible. Lesser the time, lesser the contents in audit trail, lesser the
 overall downtime.
 
 You can develop some in house solution for this, or use DistCp, provided by
 Hadoop that uses copies over the data using Map/Reduce.
 
 
 On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
 Sorry at 1PB of disk... compression isn't going to really help a whole
 heck of a lot. Your networking bandwidth will be your bottleneck.
 
 So lets look at the problem.
 
 How much down time can you afford?
 What does your hardware look like?
 How much space do you have in your current data center?
 
 You have 1PB of data. OK, what does the access pattern look like?
 
 There are a couple of ways to slice and dice this. How many trucks do you
 have?
 
 On Aug 3, 2012, at 4:24 PM, Harit Himanshu harit.subscripti...@gmail.com
 wrote:
 
 Moving 1 PB of data would take loads of time,
 - check if this new data center provides something similar to
 http://aws.amazon.com/importexport/
 - Consider multi part uploading of data
 - consider compressing the data
 
 
 On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
 
 thanks for response.
 Physical move is not a choice in this case. Purely looking for copying
 data and how to catch up with the update of a file while it is being
 migrated.
 
 On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote:
 sometimes, physically moving hard drives helps.   :)
 On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum 
 silvianhad...@gmail.com
 wrote:
 
 Hi Hadoopers,
 
 We have a plan to migrate Hadoop cluster to a different datacenter
 where we can triple the size of the cluster.
 Currently, our 0.20.2 cluster have around 1PB of data. We use only
 Java/Pig.
 
 I would like to get some input how we gonna handle with transferring
 1PB of data to a new site, and also keep up with
 new files that thrown into cluster all the time.
 
 Happy friday !!
 
 P

Re: Decommisioning runs for ever

2012-08-06 Thread Michael Segel

Did you change the background bandwidth from 10mbs to something higher? 
Worst case is that you can kill the DN and wait 10 mins for the cluster to 
realize its down and then rebalance.
(Its ugly, but it works.)

On Aug 6, 2012, at 7:59 PM, Chandra Mohan, Ananda Vel Murugan 
ananda.muru...@honeywell.com wrote:

 Hi,
 
 I tried decommissioning a node in my Hadoop cluster. I am running Apache 
 Hadoop 1.0.2 and ours is a four node cluster. I also have HBase installed in 
 my cluster. I have shut down region server in this node.
 
 For decommissioning, I did the following steps
 
 
  *   Added the following XML in hdfs-site.xml
 
 property
 
 namedfs.hosts.exclude/name
 
 value/full/path/of/host/exclude/file/value
 
 /property
 
 
 * Ran HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes
 
 
 
 But node decommissioning is running for the last 6 hrs. I don't know when it 
 will get over. I am in need of this node for other activities.
 
 
 
 From HDFS health status JSP:
 
 Cluster Summary
 338 files and directories, 200 blocks = 538 total. Heap Size is 16.62 MB / 
 888.94 MB (1%)
 Configured Capacity
 
 :
 
 1.35 TB
 
 DFS Used
 
 :
 
 759.57 MB
 
 Non DFS Used
 
 :
 
 179.36 GB
 
 DFS Remaining
 
 :
 
 1.17 TB
 
 DFS Used%
 
 :
 
 0.05 %
 
 DFS Remaining%
 
 :
 
 86.92 %
 
 Live Nodes
 
 :
 
 4
 
 Dead Nodes
 
 :
 
 0
 
 Decommissioning Nodes
 
 :
 
 1
 
 Number of Under-Replicated Blocks
 
 :
 
 129
 
 
 
 
 Please share if you have any idea. Thanks a lot.
 
 
 
 Regards,
 
 Anand.C

Re: Decommisioning runs for ever

2012-08-06 Thread Michael Segel

Yup.

By default it looks like 10MB/Sec.
With 1GBe, you could probably push this up to 100MB/Sec or even higher 
depending on your cluster usage. 
10GBe... obviously higher. 


On Aug 6, 2012, at 8:15 PM, Chandra Mohan, Ananda Vel Murugan 
ananda.muru...@honeywell.com wrote:

 
 Are you referring to this setting dfs.balance.bandwidthPerSec ?
 
 
 -Original Message-
 From: Michael Segel [mailto:michael_se...@hotmail.com] 
 Sent: Tuesday, August 07, 2012 6:36 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Decommisioning runs for ever
 
 Did you change the background bandwidth from 10mbs to something higher? 
 Worst case is that you can kill the DN and wait 10 mins for the cluster to 
 realize its down and then rebalance.
 (Its ugly, but it works.)
 
 On Aug 6, 2012, at 7:59 PM, Chandra Mohan, Ananda Vel Murugan 
 ananda.muru...@honeywell.com wrote:
 
 Hi,
 
 I tried decommissioning a node in my Hadoop cluster. I am running Apache 
 Hadoop 1.0.2 and ours is a four node cluster. I also have HBase installed in 
 my cluster. I have shut down region server in this node.
 
 For decommissioning, I did the following steps
 
 
 *   Added the following XML in hdfs-site.xml
 
 property
 
 namedfs.hosts.exclude/name
 
 value/full/path/of/host/exclude/file/value
 
 /property
 
 
 * Ran HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes
 
 
 
 But node decommissioning is running for the last 6 hrs. I don't know when it 
 will get over. I am in need of this node for other activities.
 
 
 
 From HDFS health status JSP:
 
 Cluster Summary
 338 files and directories, 200 blocks = 538 total. Heap Size is 16.62 MB / 
 888.94 MB (1%)
 Configured Capacity
 
 :
 
 1.35 TB
 
 DFS Used
 
 :
 
 759.57 MB
 
 Non DFS Used
 
 :
 
 179.36 GB
 
 DFS Remaining
 
 :
 
 1.17 TB
 
 DFS Used%
 
 :
 
 0.05 %
 
 DFS Remaining%
 
 :
 
 86.92 %
 
 Live Nodes
 
 :
 
 4
 
 Dead Nodes
 
 :
 
 0
 
 Decommissioning Nodes
 
 :
 
 1
 
 Number of Under-Replicated Blocks
 
 :
 
 129
 
 
 
 
 Please share if you have any idea. Thanks a lot.
 
 
 
 Regards,
 
 Anand.C

Re: migrate cluster to different datacenter

2012-08-03 Thread Michael Segel

Sorry at 1PB of disk... compression isn't going to really help a whole heck of 
a lot. Your networking bandwidth will be your bottleneck.

So lets look at the problem. 

How much down time can you afford? 
What does your hardware look like? 
How much space do you have in your current data center? 

You have 1PB of data. OK, what does the access pattern look like? 

There are a couple of ways to slice and dice this. How many trucks do you have? 

On Aug 3, 2012, at 4:24 PM, Harit Himanshu harit.subscripti...@gmail.com 
wrote:

 Moving 1 PB of data would take loads of time, 
 - check if this new data center provides something similar to 
 http://aws.amazon.com/importexport/
 - Consider multi part uploading of data
 - consider compressing the data
 
 
 On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
 
 thanks for response.
 Physical move is not a choice in this case. Purely looking for copying
 data and how to catch up with the update of a file while it is being
 migrated.
 
 On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote:
 sometimes, physically moving hard drives helps.   :)
 On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum silvianhad...@gmail.com
 wrote:
 
 Hi Hadoopers,
 
 We have a plan to migrate Hadoop cluster to a different datacenter
 where we can triple the size of the cluster.
 Currently, our 0.20.2 cluster have around 1PB of data. We use only
 Java/Pig.
 
 I would like to get some input how we gonna handle with transferring
 1PB of data to a new site, and also keep up with
 new files that thrown into cluster all the time.
 
 Happy friday !!
 
 P

Re: Merge Reducers Output

2012-07-31 Thread Michael Segel

You really don't want to run a single reducer unless you know that you don't 
have a lot of mappers. 

As long as the output data types and structure are the same as the input, you 
can run your code as the combiner, and then run it again as the reducer. 
Problem solved with one or two lines of code. 
If your input and output don't match, then you can use the existing code as a 
combiner, and then write a new reducer. It could as easily be an identity 
reducer too. (Don't know the exact problem.) 

So here's a silly question. Why wouldn't you want to run a combiner? 


On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote:

 Its not clear to me that you need custom input formats
 
 1) Getmerge might work or
 
 2) Simply run a SINGLE reducer job (have mappers output static final int
 key=1, or specify numReducers=1).
 
 In this case, only one reducer will be called, and it will read through all
 the values.
 
 On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote:
 
 Hi
 
 Why not use 'hadoop fs -getMerge outputFolderInHdfs
 targetFileNameInLfs' while copying files out of hdfs for the end users to
 consume. This will merge all the files in 'outputFolderInHdfs'  into one
 file and put it in lfs.
 
 Regards
 Bejoy KS
 
 Sent from handheld, please excuse typos.
 
 -Original Message-
 From: Michael Segel michael_se...@hotmail.com
 Date: Mon, 30 Jul 2012 21:08:22
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Merge Reducers Output
 
 Why not use a combiner?
 
 On Jul 30, 2012, at 7:59 PM, Mike S wrote:
 
 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends MapperNullWritable,
 BytesWritable, IntWritable, BytesWritable
  {
  @Override
  public void map(NullWritable key, BytesWritable value,
 Context
 context) throws IOException, InterruptedException
  {
  context.write(key, value);
  }
  }
 
  public static class Reduce extends ReducerNullWritable,
 BytesWritable, NullWritable, BytesWritable
  {
  @Override
  public void reduce(NullWritable key,
 IterableBytesWritable values,
 Context context) throws IOException, InterruptedException
  {
  for (BytesWritable value : values)
  {
  context.write(NullWritable.get(), value);
  }
  }
  }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?
 
 
 
 
 
 -- 
 Jay Vyas
 MMSB/UCHC

Re: Merge Reducers Output

2012-07-31 Thread Michael Segel

Sorry, but the OP was saying he had map/reduce job where the job had multiple 
reducers where he wanted to then combine the output to a single file. 
While you could merge the output files, you could also use a combiner then an 
identity reducer all within the same M/R job.


On Jul 31, 2012, at 10:10 AM, Raj Vishwanathan rajv...@yahoo.com wrote:

 Is there a requirement for the final reduce file to be sorted? If not, 
 wouldn't a map only job ( +  a combiner, ) and a merge only job provide the 
 answer?
 
 Raj
 
 
 
 
 From: Michael Segel michael_se...@hotmail.com
 To: common-user@hadoop.apache.org 
 Sent: Tuesday, July 31, 2012 5:24 AM
 Subject: Re: Merge Reducers Output
 
 You really don't want to run a single reducer unless you know that you don't 
 have a lot of mappers. 
 
 As long as the output data types and structure are the same as the input, 
 you can run your code as the combiner, and then run it again as the reducer. 
 Problem solved with one or two lines of code. 
 If your input and output don't match, then you can use the existing code as 
 a combiner, and then write a new reducer. It could as easily be an identity 
 reducer too. (Don't know the exact problem.) 
 
 So here's a silly question. Why wouldn't you want to run a combiner? 
 
 
 On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote:
 
 Its not clear to me that you need custom input formats
 
 1) Getmerge might work or
 
 2) Simply run a SINGLE reducer job (have mappers output static final int
 key=1, or specify numReducers=1).
 
 In this case, only one reducer will be called, and it will read through all
 the values.
 
 On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote:
 
 Hi
 
 Why not use 'hadoop fs -getMerge outputFolderInHdfs
 targetFileNameInLfs' while copying files out of hdfs for the end users to
 consume. This will merge all the files in 'outputFolderInHdfs'  into one
 file and put it in lfs.
 
 Regards
 Bejoy KS
 
 Sent from handheld, please excuse typos.
 
 -Original Message-
 From: Michael Segel michael_se...@hotmail.com
 Date: Mon, 30 Jul 2012 21:08:22
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Merge Reducers Output
 
 Why not use a combiner?
 
 On Jul 30, 2012, at 7:59 PM, Mike S wrote:
 
 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends MapperNullWritable,
 BytesWritable, IntWritable, BytesWritable
   {
   @Override
   public void map(NullWritable key, BytesWritable value,
 Context
 context) throws IOException, InterruptedException
   {
   context.write(key, value);
   }
   }
 
   public static class Reduce extends ReducerNullWritable,
 BytesWritable, NullWritable, BytesWritable
   {
   @Override
   public void reduce(NullWritable key,
 IterableBytesWritable values,
 Context context) throws IOException, InterruptedException
   {
   for (BytesWritable value : values)
   {
   context.write(NullWritable.get(), value);
   }
   }
   }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?
 
 
 
 
 
 -- 
 Jay Vyas
 MMSB/UCHC

Re: task jvm bootstrapping via distributed cache

2012-07-31 Thread Michael Segel

Hi Stan,

If I understood your question... you want to ship a jar to the nodes where the
task will run prior to the start of the task?

Not sure what it is you're trying to do...
Your example isn't really clear.

See:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html

When you pull stuff out of the cache you get the path to the jar.
Or you should be able to get it.

I'm assuming you're doing this in your setup() method?

Can you give a better example, there may be a different way to handle this...

On Jul 31, 2012, at 3:50 PM, Stan Rosenberg stan.rosenb...@gmail.com wrote:

Forwarding to common-user to hopefully get more exposure...

-- Forwarded message --
From: Stan Rosenberg stan.rosenb...@gmail.com
Date: Tue, Jul 31, 2012 at 11:55 AM
Subject: Re: task jvm bootstrapping via distributed cache
To: mapreduce-u...@hadoop.apache.org

I am guessing this is either a well-known problem or an edge case. In
any case, would it be a bad idea to designate predetermined output
paths?
E.g., DistributedCache.addCacheFileInto(uri, conf, outputPath) would
attempt to copy the cached file into the specified path resolving to a
task's local filesystem.

Thanks,

stan

On Mon, Jul 30, 2012 at 6:23 PM, Stan Rosenberg
stan.rosenb...@gmail.com wrote:
Hi,

I am seeking a way to leverage hadoop's distributed cache in order to
ship jars that are required to bootstrap a task's jvm, i.e., before a
map/reduce task is launched.
As a concrete example, let's say that I need to launch with
'-javaagent:/path/profiler.jar'. In theory, the task tracker is
responsible for downloading cached files onto its local filesystem.
However, the absolute path to a given cached file is not known a
priori; however, we need the path in order to configure '-javaagent'.

Is this currently possible with the distributed cache? If not, is the
use case appealing enough to open a jira ticket?

Thanks,

stan

Re: Merge Reducers Output

2012-07-30 Thread Michael Segel

Why not use a combiner?

On Jul 30, 2012, at 7:59 PM, Mike S wrote:

 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends MapperNullWritable,
 BytesWritable, IntWritable, BytesWritable
   {
   @Override
   public void map(NullWritable key, BytesWritable value, Context
 context) throws IOException, InterruptedException
   {
   context.write(key, value);
   }
   }
 
   public static class Reduce extends ReducerNullWritable,
 BytesWritable, NullWritable, BytesWritable
   {
   @Override
   public void reduce(NullWritable key, IterableBytesWritable 
 values,
 Context context) throws IOException, InterruptedException
   {
   for (BytesWritable value : values)
   {
   context.write(NullWritable.get(), value);
   }
   }
   }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?

Re: Counting records

2012-07-23 Thread Michael Segel

Look at using a dynamic counter. 

You don't need to set up or declare an enum. 
The only caveat is that counters are passed back to the JT by each task and are 
stored in memory. 


On Jul 23, 2012, at 9:32 AM, Kai Voigt wrote:

 http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/

Re: Counting records

2012-07-23 Thread Michael Segel

If the task fails the counter for that task is not used. 

So if you have speculative execution turned on and the JT kills a task, it 
won't affect your end results.  

Again the only major caveat is that the counters are in memory so if you have a 
lot of counters... 

On Jul 23, 2012, at 4:52 PM, Peter Marron wrote:

 Yeah, I thought about using counters but I was worried about
 what happens if a Mapper task fails. Does the counter get adjusted to
 remove any contributions that the failed Mapper made before
 another replacement Mapper is started? Otherwise in the case of any
 Mapper failure I'm going to get an overcount am I not?
 
 Or is there some way to make sure that counters have
 the correct semantics in the face of failures?
 
 Peter Marron
 
 -Original Message-
 From: Dave Shine
 [mailto:Dave.Shine@channelintelligence.
 com]
 Sent: 23 July 2012 15:35
 To: common-user@hadoop.apache.org
 Subject: RE: Counting records
 
 You could just use a counter and never
 emit anything from the Map().  Use the
 getCounter(MyRecords,
 RecordTypeToCount).increment(1)
 whenever you find the type of record you
 are looking for.  Never call
 output.collect().  Call the job with
 reduceTasks(0).  When the job finishes,
 you can programmatically get the values
 of all counters including the one you
 create in the Map() method.
 
 
 Dave Shine
 Sr. Software Engineer
 321.939.5093 direct |  407.314.0122
 mobile CI Boost(tm) Clients  Outperform
 Online(tm)  www.ciboost.com
 
 
 -Original Message-
 From: Peter Marron
 [mailto:Peter.Marron@trilliumsoftware.
 com]
 Sent: Monday, July 23, 2012 10:25 AM
 To: common-user@hadoop.apache.org
 Subject: Counting records
 
 Hi,
 
 I am a complete noob with Hadoop and
 MapReduce and I have a question that is
 probably silly, but I still don't know the
 answer.
 
 For the purposes of discussion I'll assume
 that I'm using a standard
 TextInputFormat.
 (I don't think that this changes things too
 much.)
 
 To simplify (a fair bit) I want to count all
 the records that meet specific criteria.
 I would like to use MapReduce because I
 anticipate large sources and I want to
 get the performance and reliability that
 MapReduce offers.
 
 So the obvious and simple approach is to
 have my Mapper check whether each
 record meets the criteria and emit a 0 or
 a 1. Then I could use a combiner which
 accumulates (like a LongSumReducer)
 and use this as a reducer as well, and I
 am sure that that would work fine.
 
 However it seems massive overkill to
 have all those 1s and 0s emitted and
 stored on disc.
 It seems tempting to have the Mapper
 accumulate the count for all of the
 records that it sees and then just emit
 once at the end the total value. This
 seems simple enough, except that the
 Mapper doesn't seem to have any easy
 way to know when it is presented with
 the last record.
 
 Now I could just make the Mapper take a
 copy of the OutputCollector for each
 record called and then in the close
 method it could do a single emit.
 However, although, this looks like it
 would work with the current
 implementation, there seem to be no
 guarantees that the collector is valid at
 the time that the close is called. This just
 seems ugly.
 
 Or I could get the Mapper to record the
 first offset that it sees and read the split
 length using
 report.getInputSplit().getLength() and
 then it could monitor how far it is
 through the split and it should be able to
 detect the last record. It looks like the
 MapRunner class creates a Mapper
 object and uses it to process a split, and
 so it looks like it's safe to store state in
 the mapper class between invocations of
 the map method. (But is this just an
 implementation artefact? Is the mapper
 class supposed to be completely
 stateless?)
 
 Maybe I should have a custom
 InputFormat class and have it flag the
 last record by placing some extra
 information in the key? (Assuming that
 the InputFormant has enough
 information from the split to be able to
 detect the last record, which seems
 reasonable enough.)
 
 Is there some blessed way to do this?
 Or am I barking up the wrong tree
 because I should really just generate all
 those 1s and 0s and accept the
 overhead?
 
 Regards,
 
 Peter Marron
 Trillium Software UK Limited
 
 
 The information contained in this email
 message is considered confidential and
 proprietary to the sender and is intended
 solely for review and use by the named
 recipient. Any unauthorized review, use
 or distribution is strictly prohibited. If you
 have received this message in error,
 please advise the sender by reply email
 and delete the message.

Re: Concurrency control

2012-07-18 Thread Michael Segel

It goes beyond the atomic writes... 
There isn't the concept of transactions in HBase. 

He could also be talking about Hive, which would be appropriate for this 
mailing list... 

-Mike

On Jul 18, 2012, at 11:49 AM, Harsh J wrote:

 Hi,
 
 Do note that there are many users who haven't used Teradata out there
 and they may not directly pick up what you meant to say here.
 
 Since you're speaking of Tables, I am going to assume you mean HBase.
 If what you're looking for is atomicity, HBase does offer it already.
 If you want to order requests differently, depending on a condition,
 the HBase coprocessors (new from Apache HBase 0.92 onwards) provide
 you an ability to do that too. If your question is indeed specific to
 HBase, please ask it in a more clarified form on the
 u...@hbase.apache.org lists.
 
 If not HBase, do you mean read/write concurrency over HDFS files?
 Cause HDFS files do not allow concurrent writers (one active lease per
 file), AFAICT.
 
 On Wed, Jul 18, 2012 at 9:09 PM, saubhagya dey saubhagya@gmail.com 
 wrote:
 how do i manage concurrency in hadoop like we do in teradata.
 We need to have a read and write lock when simultaneous the same table is
 being hit with a read query and write query
 
 
 
 -- 
 Harsh J

Re: Hive/Hdfs Connector

2012-07-05 Thread Michael Segel

Have you tried using Hive's thrift server? 

On Jul 5, 2012, at 10:20 AM, Sandeep Reddy P wrote:

 We use hive Jdbc drivers to connect to RDMS. But we need our application
 which generates HQL to connect directly to Hive.
 
 On Thu, Jul 5, 2012 at 11:12 AM, Bejoy KS bejoy.had...@gmail.com wrote:
 
 Hi Sandeep
 
 You can connect to hdfs from a remote machine if that machine is reachable
 from the cluster, and you have the hadoop jars and right hadoop
 configuration files.
 
 Similarly you can issue HQL programatically from your application using
 hive jdbc driver.
 
 --Original Message--
 From: Sandeep Reddy P
 To: common-user@hadoop.apache.org
 To: cdh-u...@cloudera.org
 Cc: t...@cloudwick.com
 ReplyTo: common-user@hadoop.apache.org
 Subject: Hive/Hdfs Connector
 Sent: Jul 5, 2012 20:32
 
 Hi,
 We have some application which generates SQL queries and connects to RDBMS
 using connectors like JDBC for mysql. Now if we generate HQL using our
 application is there any way to connect to Hive/Hdfs using connectors?? I
 need help on what connectors i have to use?
 We dont want to pull data from Hive/Hdfs to RDBMS instead we need our
 application to connect to Hive/Hdfs.
 
 --
 Thanks,
 sandeep
 
 
 
 Regards
 Bejoy KS
 
 Sent from handheld, please excuse typos.
 
 
 
 
 -- 
 Thanks,
 sandeep

Re: Multiple cores vs multiple nodes

2012-07-02 Thread Michael Segel

Hi,
First, you have to explain what you mean by 'equivalent' .


The short answer is that it depends.
The longer answer is that you have to consider cost in your design. 

The whole issue of design is to maintain the correct ratio of cores to memory 
and cores to spindles while optimizing the box within the cost, space and 
hardware (box configurations) limitations. 

Note that you can sacrifice some of the ratio, however, you will leave some of 
the performance on the table. 


On Jul 1, 2012, at 6:13 AM, Safdar Kureishy wrote:

 Hi,
 
 I have a reasonably simple question that I thought I'd post to this list
 because I don't have enough experience with hardware to figure this out
 myself.
 
 Let's assume that I have 2 separate cluster setups for slave nodes. The
 master node is a separate machine *outside* these clusters:
 *Setup A*: 28 nodes, each with a 2-core CPU, 8 GB RAM and 1 SATA drives (1
 TB each)
 *Setup B*: 7 nodes, each with a 8-core CPU, 32 GB Ram and 4 SATA drives (1
 TB each)
 
 Note that I have maintained the same *core:memory:spindle* ratio above. In
 essence, setup B has the same overall processing + memory + spindle
 capacity, but achieved with 4 times fewer nodes.
 
 Ignoring the* cost* of each node above, and assuming a 10Gb Ethernet
 connectivity and the same speed-per-core across nodes in both the scenarios
 above, are Setup A and Setup B equivalent to each other in the context of
 setting up a Hadoop cluster? Or will the relative performance be different?
 Excluding the network connectivity between the nodes, what would be some
 other criteria that might give one setup an edge over the other, for
 regular Hadoop jobs?
 
 Also, assuming the same type of Hadoop jobs on both clusters, how different
 would the load experienced by the master node be for each setup above?
 
 Thanks in advance,
 Safdar

Re: Can I remove a folder on HDFS when decommissioning a data node?

2012-06-26 Thread Michael Segel

Well, there you go... care to open a Jira for it? 
It would be very helpful... The OP's question presents a very good use case. 

Thx

-Mike

On Jun 26, 2012, at 10:00 AM, Harsh J wrote:

 Hi,
 
 On Tue, Jun 26, 2012 at 8:21 PM, Michael Segel
 michael_se...@hotmail.com wrote:
 Hi,
 
 Yes you can remove a file while there is a node or node(s) being 
 decommissioned.
 
 I wonder if there's a way to manually clear out the .trash which may also 
 give you more space.
 
 There's no way to do that across all users presently (for each user,
 -expunge works though), other than running a manual rm (-skipTrash)
 command for the .Trash files under each /user/* directory. I suppose
 we can add in an admin command to clear out all trash manually
 (forcing, other than relying on the periodic auto-emptier thread).
 
 -- 
 Harsh J

Re: Hive error when loading csv data.

2012-06-26 Thread Michael Segel

Alternatively you could write a simple script to convert the csv to a pipe 
delimited file so that abc,def will be abc,def.

On Jun 26, 2012, at 2:51 PM, Harsh J wrote:

 Hive's delimited-fields-format record reader does not handle quoted
 text that carry the same delimiter within them. Excel supports such
 records, so it reads it fine.
 
 You will need to create your table with a custom InputFormat class
 that can handle this (Try using OpenCSV readers, they support this),
 instead of relying on Hive to do this for you. If you're successful in
 your approach, please also consider contributing something back to
 Hive/Pig to help others.
 
 On Wed, Jun 27, 2012 at 12:37 AM, Sandeep Reddy P
 sandeepreddy.3...@gmail.com wrote:
 
 
 Hi all,
 I have a csv file with 46 columns but i'm getting error when i do some
 analysis on that data type. For simplification i have taken 3 columns and
 now my csv is like
 c,zxy,xyz
 d,abc,def,abcd
 
 i have created table for this data using,
 hive create table test3(
  f1 string,
  f2 string,
  f3 string)
  row format delimited
  fields terminated by ,;
 OK
 Time taken: 0.143 seconds
 hive load data local inpath '/home/training/a.csv'
  into table test3;
 Copying data from file:/home/training/a.csv
 Copying file: file:/home/training/a.csv
 Loading data to table default.test3
 OK
 Time taken: 0.276 seconds
 hive select * from test3;
 OK
 c   zxy xyz
 d   abcdef
 Time taken: 0.156 seconds
 
 When i do select f2 from test3;
 my results are,
 OK
 zxy
 abc
 but this should be abc,def
 When i open the same csv file with Microsoft Excel i got abc,def
 How should i solve this error??
 
 
 
 --
 Thanks,
 sandeep
 
 --
 
 
 
 
 
 
 -- 
 Harsh J

Re: oozie workflow file for teragen and terasort

2012-06-24 Thread Michael Segel

There's a series of articles on Oozie by Boris Lublinsky over on InfoQ.

http://www.infoq.com/articles/introductionOozie

On Jun 23, 2012, at 7:42 PM, Hadoop James wrote:

 I want to be able to submit my teragen and terasort jobs via oozie.
 I have tried different things in workflow.xml to no avail.
 Has anybody had any success doing so? Can you share your workflow.xml
 file ?
 
 Many thanks
 
 -James

Re: Split brain - is it possible in hadoop?

2012-06-19 Thread Michael Segel

In your example, you only have one active Name Node. So how would you encounter 
a 'split brain' scenario? 
Maybe it would be better if you defined what you mean by a split brain?

-Mike

On Jun 18, 2012, at 8:30 PM, hdev ml wrote:

 All hadoop contributors/experts,
 
 I am trying to simulate split brain in our installation. There are a few
 things we want to know
 
 1. Does data corruption happen?
 2. If Yes in #1, how to recover from it.
 3. What are the corrective steps to take in this situation e.g. killing one
 namenode etc
 
 So to simulate this I took following steps.
 
 1. We already have a healthy test cluster, consisting of 4 machines. One
 machine runs namenode and a datanode, other machine runs secondarynamenode
 and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a
 datanode.
 2. Copied the hadoop installation folder to a new location in the datanode.
 3. Kept all configurations same in hdfs-site and core-site xmls, except
 renamed the fs.default.name to a different URI
 4. The namenode directory - dfs.name.dir was pointing to the same shared
 NFS mounted directory to which the main namenode points to.
 
 I started this standby namenode using following command
 bin/hadoop-daemon.sh --config conf --hosts slaves start namenode
 
 It errored out saying that the directory is already locked, which is an
 expected behaviour. The directory has been locked by the original namenode.
 
 So I changed the dfs.name.dir to some other folder, and issued the same
 command. It fails with message - namenode has not been formatted, which
 is also expected.
 
 This makes me think - does splitbrain situation really occur in hadoop?
 
 My understanding is that split brain happens because of timeouts on the
 main namenode. The way it happens is, when the timeout occurs, the HA
 implementation - Be it Linux HA, Veritas etc., thinks that the main
 namenode has died and tries to start the standby namenode. The standby
 namenode starts up and then main namenode comes back from the timeout phase
 and starts functioning as if nothing happened, giving rise to 2 namenodes
 in the cluster - Split Brain.
 
 Considering the error messages and the above understanding, I cannot point
 2 different namenodes to same directory, because the main namenode isn't
 responding but has locked the directory.
 
 So can I safely conclude that split brain does not occur in hadoop?
 
 Or am I missing any other situation where split brain happens and the
 namenode directory is not locked, thus allowing the standby namenode also
 to start up?
 
 Has anybody encountered this?
 
 Any help is really appreciated.
 
 Harshad

Re: Hardware specs calculation for io

2012-06-13 Thread Michael Segel

You will want something in between...

8 cores means 8 spindles. 

16 cores means 16 spindles. 

You may want to up the memory, especially if you're running or thinking about 
running HBase. 

If you go beyond 4 spindles, you will saturate your 1GBe link. If you think 
about Type B, you will need 10GBe.


On Jun 13, 2012, at 9:36 AM, Sandeep Reddy P wrote:

 Hi,
 I need to know difference between two hardware configurations below for
 24TB of data. (slave machines only for hadoop,hive and pig)
 
 TYPE A: 2 quad core, 32 GB memory, 6 x 1TB drives(6TB / machine)
 
 TYPE B: 4 quad core, 48 GB memory, 12 x 1TB drives (12TB / machine)
 
 suppose we choose 4 type A machines for 24tb of data and 2 type b machines
 for 24 tb data. Assuming disk io speed is constant (7200 RPM sata), cost is
 same for 4Type A and 2 Type B machines.
 
 I need which type of machines will give me best results in terms of
 performance.
 
 
 -- 
 Thanks,
 sandeep

Re: Hadoop with Sharded MySql

2012-06-01 Thread Michael Segel

Ok just tossing out some ideas... Take them with a grain of salt...

With hive you can create external tables.

Write a custom Java app the creates one thread to each server. Then iterate 
through each table selecting the rows you want. You can then easily write the 
output directly to HDFS in each thread.
It's not a map reduce, but it should be fairly efficient.

You can even expand on this if you want.
Java and jdbc...


Sent from my iPhone

On Jun 1, 2012, at 11:30 AM, Srinivas Surasani hivehadooplearn...@gmail.com 
wrote:

 All,
 
 I'm trying to get data into HDFS directly from sharded database and expose
 to existing hive infrastructure.
 
 ( we are currently doing this way,, mysql-staging server-hdfs put
 commands-hdfs, which is taking lot of time ).
 
 If we have way of running single sqoop job across all shardes for single
 table, I believe it makes life easier in terms of monotoring and exception
 handlings..
 
 Thanks,
 Srinivas
 
 On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote:
 
 Hi Sujith,
 
 Srinivas is asking how to import data into HDFS using sqoop?  I believe he
 must have thought out well before designing the entire
 architecture/solution. He has not specified whether he would like to modify
 the data or not. Whether to use HIve or HBase is a different question
 altogether and depends on his use-case.
 
 Thanks,
 Anil
 
 
 On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:
 
 Hi ,
 instead of pulling 70K tables from mysql into hdfs.
 take dump of all 30 table and put in to hBase data base .
 
 if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
 but
 modification will not possible in Hive :(
 
 *@ common-user :* please correct me , if i am wrong .
 
 Kind Regards
 Sujit Dhamale
 (+91 9970086652)
 On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 Maybe you can do some VIEWs or unions or merge tables on the mysql
 side to overcome the aspect of launching so many sqoop jobs.
 
 On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
 hivehadooplearn...@gmail.com wrote:
 All,
 
 We are trying to implement sqoop in our environment which has 30
 mysql
 sharded databases and all the databases have around 30 databases with
 150 tables in each of the database which are all sharded
 (horizontally
 sharded that means the data is divided into all the tables in mysql).
 
 The problem is that we have a total of around 70K tables which needed
 to be pulled from mysql into hdfs.
 
 So, my question is that generating 70K sqoop commands and running
 them
 parallel is feasible or not?
 
 Also, doing incremental updates is going to be like invoking 70K
 another sqoop jobs which intern kick of map-reduce jobs.
 
 The main problem is monitoring and managing this huge number of jobs?
 
 Can anyone suggest me the best way of doing it or is sqoop a good
 candidate for this type of scenario?
 
 Currently the same process is done by generating tsv files  mysql
 server and dumped into staging server and  from there we'll generate
 hdfs put statements..
 
 Appreciate your suggestions !!!
 
 
 Thanks,
 Srinivas Surasani
 
 
 
 
 
 --
 Thanks  Regards,
 Anil Gupta
 
 
 
 
 -- 
 Regards,
 -- Srinivas
 srini...@cloudwick.com

Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread Michael Segel

I believe that their CDH3u3 or later has this... parameter. 
(Possibly even earlier.)

On May 29, 2012, at 6:12 AM, samir das mohapatra wrote:

 It is cloudera version .20
 
 On Tue, May 29, 2012 at 4:14 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Which release? Version?
 I believe there are variables in the *-site.xml that allow LDAP
 integration ...
 
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com
 wrote:
 
 Hi All,
 
  Did any one work on hadoop with LDAP integration.
  Please help me for same.
 
 Thanks
 samir

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Michael Segel

Hi,
That's not a back up strategy. 
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage, obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks, so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.

Re: is hadoop suitable for us?

2012-05-18 Thread Michael Segel

You are going to have to put HDFS on top of your SAN. 

The issue is that you introduce overhead and latencies by having attached 
storage rather than the drives physically on the bus within the case. 

Also I'm going to assume that your SAN is using RAID. 
One of the side effects of using a SAN is that you could reduce your 
replication factor from 3 to 2. 
(The SAN already protects you from disk failures if you're using RAID)


On May 17, 2012, at 11:10 PM, Pierre Antoine DuBoDeNa wrote:

 You used HDFS too? or storing everything on SAN immediately?
 
 I don't have number of GB/TB (it might be about 2TB so not really that
 huge) but they are more than 100 million documents to be processed. In a
 single machine currently we can process about 200.000 docs/day (several
 parsing, indexing, metadata extraction has to be done). So in the worst
 case we want to use the 50 VMs to distribute the processing..
 
 2012/5/17 Sagar Shukla sagar_shu...@persistent.co.in
 
 Hi PA,
In my environment, we had a SAN storage and I/O was pretty good. So if
 you have similar environment then I don't see any performance issues.
 
 Just out of curiosity - what amount of data are you looking forward to
 process ?
 
 Regards,
 Sagar
 
 -Original Message-
 From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com]
 Sent: Thursday, May 17, 2012 8:29 PM
 To: common-user@hadoop.apache.org
 Subject: Re: is hadoop suitable for us?
 
 Thanks Sagar, Mathias and Michael for your replies.
 
 It seems we will have to go with hadoop even if I/O will be slow due to
 our configuration.
 
 I will try to update on how it worked for our case.
 
 Best,
 PA
 
 
 
 2012/5/17 Michael Segel michael_se...@hotmail.com
 
 The short answer is yes.
 The longer answer is that you will have to account for the latencies.
 
 There is more but you get the idea..
 
 Sent from my iPhone
 
 On May 17, 2012, at 5:33 PM, Pierre Antoine Du Bois De Naurois 
 pad...@gmail.com wrote:
 
 We have large amount of text files that we want to process and index
 (plus
 applying other algorithms).
 
 The problem is that our configuration is share-everything while
 hadoop
 has
 a share-nothing configuration.
 
 We have 50 VMs and not actual servers, and these share a huge
 central storage. So using HDFS might not be really useful as
 replication will not help, distribution of files have no meaning as
 all files will be again located in the same HDD. I am afraid that
 I/O will be very slow with or without HDFS. So i am wondering if it
 will really help us to use hadoop/hbase/pig etc. to distribute and
 do several parallel tasks.. or is better to install something
 different (which i am not sure what). We heard myHadoop is better
 for such kind of configurations, have any clue about it?
 
 For example we now have a central mySQL to check if we have already
 processed a document and keeping there several metadata. Soon we
 will
 have
 to distribute it as there is not enough space in one VM, But
 Hadoop/HBase will be useful? we don't want to do any complex
 join/sort of the data, we just want to do queries to check if
 already processed a document, and if not to add it with several of
 it's metadata.
 
 We heard sungrid for example is another way to go but it's
 commercial. We are somewhat lost.. so any help/ideas/suggestions are
 appreciated.
 
 Best,
 PA
 
 
 
 2012/5/17 Abhishek Pratap Singh manu.i...@gmail.com
 
 Hi,
 
 For your question if HADOOP can be used without HDFS, the answer is
 Yes.
 Hadoop can be used with any kind of distributed file system.
 But I m not able to understand the problem statement clearly to
 advice
 my
 point of view.
 Are you processing text file and saving in distributed database??
 
 Regards,
 Abhishek
 
 On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
  pad...@gmail.com wrote:
 
 We want to distribute processing of text files.. processing of
 large machine learning tasks, have a distributed database as we
 have big
 amount
 of data etc.
 
 The problem is that each VM can have up to 2TB of data (limitation
 of
 VM),
 and we have 20TB of data. So we have to distribute the processing,
 the database etc. But all those data will be in a shared huge
 central file system.
 
 We heard about myHadoop, but we are not sure why is that any
 different
 from
 Hadoop.
 
 If we run hadoop/mapreduce without using HDFS? is that an option?
 
 best,
 PA
 
 
 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com
 
 Hadoop does not perform well with shared storage and vms.
 
 The question should be asked first regarding what you're trying
 to
 achieve,
 not about your infra.
 On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois 
 pad...@gmail.com wrote:
 
 Hello,
 
 We have about 50 VMs and we want to distribute processing across
 them.
 However these VMs share a huge data storage system and thus
 their
 virtual
 HDD are all located in the same computer. Would Hadoop be useful
 for
 such
 configuration? Could we use hadoop

Re: is hadoop suitable for us?

2012-05-17 Thread Michael Segel

The short answer is yes. 
The longer answer is that you will have to account for the latencies.

There is more but you get the idea..

Sent from my iPhone

On May 17, 2012, at 5:33 PM, Pierre Antoine Du Bois De Naurois 
pad...@gmail.com wrote:

 We have large amount of text files that we want to process and index (plus
 applying other algorithms).
 
 The problem is that our configuration is share-everything while hadoop has
 a share-nothing configuration.
 
 We have 50 VMs and not actual servers, and these share a huge central
 storage. So using HDFS might not be really useful as replication will not
 help, distribution of files have no meaning as all files will be again
 located in the same HDD. I am afraid that I/O will be very slow with or
 without HDFS. So i am wondering if it will really help us to use
 hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is
 better to install something different (which i am not sure what). We
 heard myHadoop is better for such kind of configurations, have any clue
 about it?
 
 For example we now have a central mySQL to check if we have already
 processed a document and keeping there several metadata. Soon we will have
 to distribute it as there is not enough space in one VM, But Hadoop/HBase
 will be useful? we don't want to do any complex join/sort of the data, we
 just want to do queries to check if already processed a document, and if
 not to add it with several of it's metadata.
 
 We heard sungrid for example is another way to go but it's commercial. We
 are somewhat lost.. so any help/ideas/suggestions are appreciated.
 
 Best,
 PA
 
 
 
 2012/5/17 Abhishek Pratap Singh manu.i...@gmail.com
 
 Hi,
 
 For your question if HADOOP can be used without HDFS, the answer is Yes.
 Hadoop can be used with any kind of distributed file system.
 But I m not able to understand the problem statement clearly to advice my
 point of view.
 Are you processing text file and saving in distributed database??
 
 Regards,
 Abhishek
 
 On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois 
 pad...@gmail.com wrote:
 
 We want to distribute processing of text files.. processing of large
 machine learning tasks, have a distributed database as we have big amount
 of data etc.
 
 The problem is that each VM can have up to 2TB of data (limitation of
 VM),
 and we have 20TB of data. So we have to distribute the processing, the
 database etc. But all those data will be in a shared huge central file
 system.
 
 We heard about myHadoop, but we are not sure why is that any different
 from
 Hadoop.
 
 If we run hadoop/mapreduce without using HDFS? is that an option?
 
 best,
 PA
 
 
 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com
 
 Hadoop does not perform well with shared storage and vms.
 
 The question should be asked first regarding what you're trying to
 achieve,
 not about your infra.
 On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois 
 pad...@gmail.com wrote:
 
 Hello,
 
 We have about 50 VMs and we want to distribute processing across
 them.
 However these VMs share a huge data storage system and thus their
 virtual
 HDD are all located in the same computer. Would Hadoop be useful for
 such
 configuration? Could we use hadoop without HDFS? so that we can
 retrieve
 and store everything in the same storage?
 
 Thanks,
 PA

Re: freeze a mapreduce job

2012-05-11 Thread Michael Segel

Just a quick note...

If your task is currently occupying a slot,  the only way to release the slot 
is to kill the specific task.
If you are using FS, you can move the task to another queue and/or you can 
lower the job's priority which will cause new tasks to spawn  slower than other 
jobs so you will eventually free up the cluster. 

There isn't a way to 'freeze' or stop a job mid state. 

Is the issue that the job has a large number of slots, or is it an issue of the 
individual tasks taking a  long time to complete? 

If its the latter, you will probably want to go to a capacity scheduler over 
the fair scheduler. 

HTH

-Mike

On May 11, 2012, at 6:08 AM, Harsh J wrote:

 I do not know about the per-host slot control (that is most likely not
 supported, or not yet anyway - and perhaps feels wrong to do), but the
 rest of the needs can be doable if you use schedulers and
 queues/pools.
 
 If you use FairScheduler (FS), ensure that this job always goes to a
 special pool and when you want to freeze the pool simply set the
 pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous
 tasks as you wish, to constrict instead of freeze. When you make
 changes to the FairScheduler configs, you do not need to restart the
 JT, and you may simply wait a few seconds for FairScheduler to refresh
 its own configs.
 
 More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html
 
 If you use CapacityScheduler (CS), then I believe you can do this by
 again making sure the job goes to a specific queue, and when needed to
 freeze it, simply set the queue's maximum-capacity to 0 (percentage)
 or to constrict it, choose a lower, positive percentage value as you
 need. You can also refresh CS to pick up config changes by refreshing
 queues via mradmin.
 
 More on CS at 
 http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
 
 Either approach will not freeze/constrict the job immediately, but
 should certainly prevent it from progressing. Meaning, their existing
 running tasks during the time of changes made to scheduler config will
 continue to run till completion but further tasks scheduling from
 those jobs shall begin seeing effect of the changes made.
 
 P.s. A better solution would be to make your job not take as many
 days, somehow? :-)
 
 On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote:
 I have a rather large map reduce job which takes few days. I was wondering
 if its possible for me to freeze the job or make the job less intensive. Is
 it possible to reduce the number of slots per host and then I can increase
 them overnight?
 
 
 tia
 
 --
 --- Get your facts first, then you can distort them as you please.--
 
 
 
 -- 
 Harsh J

Re: freeze a mapreduce job

2012-05-11 Thread Michael Segel

I haven't seen any.

Haven't really had to test that...

On May 11, 2012, at 9:03 AM, Shi Yu wrote:

 Is there any risk to suppress a job too long in FS?I guess there are some 
 parameters to control the waiting time of a job (such as timeout ,etc.),   
 for example, if a job is kept idle for more than 24 hours is there a 
 configuration deciding kill/keep that job?
 
 Shi
 
 On 5/11/2012 6:52 AM, Rita wrote:
 thanks.  I think I will investigate capacity scheduler.
 
 
 On Fri, May 11, 2012 at 7:26 AM, Michael 
 Segelmichael_se...@hotmail.comwrote:
 
 Just a quick note...
 
 If your task is currently occupying a slot,  the only way to release the
 slot is to kill the specific task.
 If you are using FS, you can move the task to another queue and/or you can
 lower the job's priority which will cause new tasks to spawn  slower than
 other jobs so you will eventually free up the cluster.
 
 There isn't a way to 'freeze' or stop a job mid state.
 
 Is the issue that the job has a large number of slots, or is it an issue
 of the individual tasks taking a  long time to complete?
 
 If its the latter, you will probably want to go to a capacity scheduler
 over the fair scheduler.
 
 HTH
 
 -Mike
 
 On May 11, 2012, at 6:08 AM, Harsh J wrote:
 
 I do not know about the per-host slot control (that is most likely not
 supported, or not yet anyway - and perhaps feels wrong to do), but the
 rest of the needs can be doable if you use schedulers and
 queues/pools.
 
 If you use FairScheduler (FS), ensure that this job always goes to a
 special pool and when you want to freeze the pool simply set the
 pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous
 tasks as you wish, to constrict instead of freeze. When you make
 changes to the FairScheduler configs, you do not need to restart the
 JT, and you may simply wait a few seconds for FairScheduler to refresh
 its own configs.
 
 More on FS at
 http://hadoop.apache.org/common/docs/current/fair_scheduler.html
 If you use CapacityScheduler (CS), then I believe you can do this by
 again making sure the job goes to a specific queue, and when needed to
 freeze it, simply set the queue's maximum-capacity to 0 (percentage)
 or to constrict it, choose a lower, positive percentage value as you
 need. You can also refresh CS to pick up config changes by refreshing
 queues via mradmin.
 
 More on CS at
 http://hadoop.apache.org/common/docs/current/capacity_scheduler.html
 Either approach will not freeze/constrict the job immediately, but
 should certainly prevent it from progressing. Meaning, their existing
 running tasks during the time of changes made to scheduler config will
 continue to run till completion but further tasks scheduling from
 those jobs shall begin seeing effect of the changes made.
 
 P.s. A better solution would be to make your job not take as many
 days, somehow? :-)
 
 On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com  wrote:
 I have a rather large map reduce job which takes few days. I was
 wondering
 if its possible for me to freeze the job or make the job less
 intensive. Is
 it possible to reduce the number of slots per host and then I can
 increase
 them overnight?
 
 
 tia
 
 --
 --- Get your facts first, then you can distort them as you please.--
 
 
 --
 Harsh J

Re: Reduce Hangs at 66%

2012-05-04 Thread Michael Segel

Well 
That was one of the things I had asked. 
ulimit -a says it all. 

But you have to do this for the users... hdfs, mapred, and hadoop

(Which is why I asked about which flavor.)

On May 3, 2012, at 7:03 PM, Raj Vishwanathan wrote:

 Keith
 
 What is the the output for ulimit -n? Your value for number of open files is 
 probably too low.
 
 Raj
 
 
 
 
 
 From: Keith Thompson kthom...@binghamton.edu
 To: common-user@hadoop.apache.org 
 Sent: Thursday, May 3, 2012 4:33 PM
 Subject: Re: Reduce Hangs at 66%
 
 I am not sure about ulimits, but I can answer the rest. It's a Cloudera
 distribution of Hadoop 0.20.2. The HDFS has 9 TB free. In the reduce step,
 I am taking keys in the form of (gridID, date), each with a value of 1. The
 reduce step just sums the 1's as the final output value for the key (It's
 counting how many people were in the gridID on a certain day).
 
 I have been running other more complicated jobs with no problem, so I'm not
 sure why this one is being peculiar. This is the code I used to execute the
 program from the command line (the source is a file on the hdfs):
 
 hadoop jar jarfile driver source /thompson/outputDensity/density1
 
 The program then executes the map and gets to 66% of the reduce, then stops
 responding. The cause of the error seems to be:
 
 Error from attempt_201202240659_6432_r_00_1: java.io.IOException:
 The temporary job-output directory
 hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
 doesn't exist!
 
 I don't understand what the _temporary is. I am assuming it's something
 Hadoop creates automatically.
 
 
 
 On Thu, May 3, 2012 at 5:02 AM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Well...
 Lots of information but still missing some of the basics...
 
 Which release and version?
 What are your ulimits set to?
 How much free disk space do you have?
 What are you attempting to do?
 
 Stuff like that.
 
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On May 2, 2012, at 4:49 PM, Keith Thompson kthom...@binghamton.edu
 wrote:
 
 I am running a task which gets to 66% of the Reduce step and then hangs
 indefinitely. Here is the log file (I apologize if I am putting too much
 here but I am not exactly sure what is relevant):
 
 2012-05-02 16:42:52,975 INFO org.apache.hadoop.mapred.JobTracker:
 Adding task (REDUCE) 'attempt_201202240659_6433_r_00_0' to tip
 task_201202240659_6433_r_00, for tracker
 'tracker_analytix7:localhost.localdomain/127.0.0.1:56515'
 2012-05-02 16:42:53,584 INFO org.apache.hadoop.mapred.JobInProgress:
 Task 'attempt_201202240659_6433_m_01_0' has completed
 task_201202240659_6433_m_01 successfully.
 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.TaskInProgress:
 Error from attempt_201202240659_6432_r_00_0: Task
 attempt_201202240659_6432_r_00_0 failed to report status for 1800
 seconds. Killing!
 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
 Removing task 'attempt_201202240659_6432_r_00_0'
 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker:
 Adding task (TASK_CLEANUP) 'attempt_201202240659_6432_r_00_0' to
 tip task_201202240659_6432_r_00, for tracker
 'tracker_analytix4:localhost.localdomain/127.0.0.1:44204'
 2012-05-02 17:00:48,763 INFO org.apache.hadoop.mapred.JobTracker:
 Removing task 'attempt_201202240659_6432_r_00_0'
 2012-05-02 17:00:48,957 INFO org.apache.hadoop.mapred.JobTracker:
 Adding task (REDUCE) 'attempt_201202240659_6432_r_00_1' to tip
 task_201202240659_6432_r_00, for tracker
 'tracker_analytix5:localhost.localdomain/127.0.0.1:59117'
 2012-05-02 17:00:56,559 INFO org.apache.hadoop.mapred.TaskInProgress:
 Error from attempt_201202240659_6432_r_00_1: java.io.IOException:
 The temporary job-output directory
 hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary
 doesn't exist!
 at
 org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
 at
 org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)
 at
 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
 at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)
 
 2012-05-02 17:00:59,903 INFO org.apache.hadoop.mapred.JobTracker:
 Removing task 'attempt_201202240659_6432_r_00_1'
 2012-05-02 17:00:59,906 INFO org.apache.hadoop.mapred.JobTracker:
 Adding task (REDUCE) 'attempt_201202240659_6432_r_00_2' to tip
 task_201202240659_6432_r_00, for tracker

Re: Changing the Java heap

2012-04-26 Thread Michael Segel

Not sure of your question. 

Java child Heap size is independent of how files are split on HDFS. 

I suggest you look at Tom White's book on HDFS and how files are split in to 
blocks. 

Blocks are split on set sizes. 64MB by default. 
Your record boundaries are not necessarily on file block boundaries so one 
process may read the rest of the last record in block A and then complete 
reading it at the start of block B. A different task may start with block B and 
skip the first n bytes until it hits the start of a record. 

HTH

-Mike

On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote:

 Within my small 2 node cluster I set up my 4 core slave node to have 4 task 
 trackers and I also limited my java heap size to -Xmx1024m
 
 Is there a possibility that when the data gets broken up that it will break 
 it at a place in the file that is not a whitespace? Or is that already 
 handled when the data on HDFS is broken up into blocks?
 
 -SB

Re: Feedback on real world production experience with Flume

2012-04-22 Thread Michael Segel

Gee Edward, what about putting a link to a company website or your blog in your 
signature... ;-)

Seriously one could also mention fuse, right?  ;-)


Sent from my iPhone

On Apr 22, 2012, at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 I think this is valid to talk about for example one need not need a
 decentralized collector if they can just write log directly to
 decentralized files in a decentralized file system. In any case it was
 not even a hard vendor pitch. It was someone describing how they
 handle centralized logging. It stated facts and it was informative.
 
 Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a
 way that performs well many of the use cases for flume and scribe like
 tools would be gone. (not all but many)
 
 I never knew there was a rule that discussing alternative software on
 a mailing list. It seems like a closed minded thing. I also doubt the
 ASF would back a rule like that. Are we not allowed to talk about EMR
 or S3, or am I not even allowed to mention S3?
 
 Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask that.)
 
 Edward
 
 On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz
 wget.n...@googlemail.com wrote:
 no. That is the Flume Open Source Mailinglist. Not a vendor list.
 
 NFS logging has nothing to do with decentralized collectors like Flume, JMS 
 or Scribe.
 
 sent via my mobile device
 
 On Apr 22, 2012, at 12:23 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 
 It seems pretty relevant. If you can directly log via NFS that is a
 viable alternative.
 
 On Sat, Apr 21, 2012 at 11:42 AM, alo alt wget.n...@googlemail.com wrote:
 We decided NO product and vendor advertising on apache mailing lists!
 I do not understand why you'll put that closed source stuff from your 
 employe in the room. It has nothing to do with flume or the use cases!
 
 --
 Alexander Lorenz
 http://mapredit.blogspot.com
 
 On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
 
 Karl,
 
 since you did ask for alternatives,  people using MapR prefer to use the
 NFS access to directly deposit data (or access it).  Works seamlessly from
 all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
 without having to load any agents on those machines. And it is fully
 automatic HA
 
 Since compression is built-in in MapR, the data gets compressed coming in
 over NFS automatically without much fuss.
 
 Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
 attached (of course, with compression, the effective throughput will
 surpass that based on how good the data can be squeezed).
 
 
 On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote:
 
 I am investigating automated methods of moving our data from the web tier
 into HDFS for processing, a process that's performed periodically.
 
 I am looking for feedback from anyone who has actually used Flume in a
 production setup (redundant, failover) successfully.  I understand it is
 now being largely rearchitected during its incubation as Apache Flume-NG,
 so I don't have full confidence in the old, stable releases.
 
 The other option would be to write our own tools.  What methods are you
 using for these kinds of tasks?  Did you write your own or does Flume (or
 something else) work for you?
 
 I'm also on the Flume mailing list, but I wanted to ask these questions
 here because I'm interested in Flume _and_ alternatives.
 
 Thank you!

Re: Multiple data centre in Hadoop

2012-04-19 Thread Michael Segel

I don't know of any open source solution in doing this... 
And yeah its something one can't talk about  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

 Where I work  we have done some things like this, but none of them are open 
 source, and I have not really been directly involved with the details of it.  
 I can guess about what it would take, but that is all it would be at this 
 point.
 
 --Bobby
 
 
 On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:
 
 Thanks bobby, I m looking for something like this. Now the question is
 what is the best strategy to do Hot/Hot or Hot/Warm.
 I need to consider the CPU and Network bandwidth, also needs to decide from
 which layer this replication should start.
 
 Regards,
 Abhishek
 
 On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Hi Abhishek,
 
 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.
 
 I hope that helps.
 
 --Bobby
 
 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:
 
 Hi Abhishek,
 
 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail
 
 2. *Rack awareness*
 
 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
 
 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:
 
 Thanks Robert.
 Is there a best practice or design than can address the High Availability
 to certain extent?
 
 ~Abhishek
 
 On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
 wrote:
 
 No it does not. Sorry
 
 
 On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:
 
 Hi All,
 
 Just wanted if hadoop supports more than one data centre. This is
 basically
 for DR purposes and High Availability where one centre goes down other
 can
 bring up.
 
 
 Regards,
 Abhishek
 
 
 
 
 
 
 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Michael Segel

How 'large' or rather in this case small is your file? 

If you're on a default system, the block sizes are 64MB. So if your file ~= 
64MB, you end up with 1 block, and you will only have 1 mapper. 


On Apr 19, 2012, at 10:10 PM, Sky wrote:

 Thanks for your reply.  After I sent my email, I found a fundamental defect - 
 in my understanding of how MR is distributed. I discovered that even though I 
 was firing off 15 COREs, the map job - which is the most expensive part of my 
 processing was run only on 1 core.
 
 To start my map job, I was creating a single file with following data:
  1 storage:/root/1.manif.txt
  2 storage:/root/2.manif.txt
  3 storage:/root/3.manif.txt
  ...
  4000 storage:/root/4000.manif.txt
 
 I thought that each of the available COREs will be assigned a map job from 
 top down from the same file one at a time, and as soon as one CORE is done, 
 it would get the next map job. However, it looks like I need to split the 
 file into the number of times. Now while that’s clearly trivial to do, I am 
 not sure how I can detect at runtime how many splits I need to do, and also 
 to deal with adding new CORES at runtime. Any suggestions?  (it doesn't have 
 to be a file, it can be a list, etc).
 
 This all would be much easier to debug, if somehow I could get my log4j logs 
 for my mappers and reducers. I can see log4j for my main launcher, but not 
 sure how to enable it for mappers and reducers.
 
 Thx
 - Akash
 
 
 -Original Message- From: Robert Evans
 Sent: Thursday, April 19, 2012 2:08 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 From what I can see your implementation seems OK, especially from a 
 performance perspective. Depending on what storage: is it is likely to be 
 your bottlekneck, not the hadoop computations.
 
 Because you are writing files directly instead of relying on Hadoop to do it 
 for you, you may need to deal with error cases that Hadoop will normally hide 
 from you, and you will not be able to turn on speculative execution. Just be 
 aware that a map or reduce task may have problems in the middle, and be 
 relaunched.  So when you are writing out your updated manifest be careful to 
 not replace the old one until the new one is completely ready and will not 
 fail, or you may lose data.  You may also need to be careful in your reduce 
 if you are writing directly to the file there too, but because it is not a 
 read modify write, but just a write it is not as critical.
 
 --Bobby Evans
 
 On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:
 
 
 
 
 Please help me architect the design of my first significant MR task beyond 
 word count. My program works well. but I am trying to optimize performance 
 to maximize use of available computing resources. I have 3 questions at the 
 bottom.
 
 Project description in an abstract sense (written in java):
 * I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
 to 4000.manif.txt
* Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS 
 (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
 storage:/root/1.manif/1223.folder/5443.Ebook.ebk
 So we are talking about millions of ebooks
 
 My task is to:
 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
 publisher, year, ebook-version).
 2. Update each of the EBOOK entry record in the manifest - with the 3 
 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
 3. Create a output file such that the named 
 publisher_year_ebook-version  contains a list of all ebook urls 
 that met that criteria.
 example:
 File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
 storage:/root/1.manif/1223.folder/2143.Ebook.ebk
 storage:/root/2.manif/2133.folder/5449.Ebook.ebk
 storage:/root/2.manif/2133.folder/5450.Ebook.ebk
 etc..
 
 and File storage:/root/summary/PENGUIN_2001_3.12.txt contains:
 storage:/root/19.manif/2223.folder/4343.Ebook.ebk
 storage:/root/13.manif/9733.folder/2149.Ebook.ebk
 storage:/root/21.manif/3233.folder/1110.Ebook.ebk
 
 etc
 
 4. finally, I also want to output statistics such that:
 publisher_year_ebook-version  COUNT_OF_URLs
 PENGUIN_2001_3.12 250,111
 RANDOMHOUSE_1999_2.01  11,322
 etc
 
 Here is how I implemented:
 * My launcher gets list of MM manifests
 * My Mapper gets one manifest.
 --- It reads the manifest, within a WHILE loop,
   --- fetches each EBOOK,  and obtain attributes from each ebook,
   --- updates the manifest for that ebook
   --- context.write(new Text(RANDOMHOUSE_1999_2.01), new 
 Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
 and exits
 * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls.
 --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt 
 with all the storage urls for the ebooks
 --- It also does a

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

2012-04-19 Thread Michael Segel

If the file is small enough you could read it in to a java object like a list 
and write your own input format that takes a list object as its input and then 
lets you specify the number of mappers.

On Apr 19, 2012, at 11:34 PM, Sky wrote:

 My file for the input to mapper is very small - as all it has is urls to list 
 of manifests. The task for mappers is to fetch each manifest, and then fetch 
 files using urls from the manifests and then process them.  Besides passing 
 around lists of files, I am not really accessing the disk. It should be RAM, 
 network, and CPU (unzip, parsexml,extract attributes).
 
 So is my only choice to break the input file and submit multiple files (if I 
 have 15 cores, I should split the file with urls to 15 files? also how does 
 it look in code?)? The two drawbacks are - some cores might finish early and 
 stay idle, and I don’t know how to deal with dynamically 
 increasing/decreasing cores.
 
 Thx
 - Sky
 
 -Original Message- From: Michael Segel
 Sent: Thursday, April 19, 2012 8:49 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 How 'large' or rather in this case small is your file?
 
 If you're on a default system, the block sizes are 64MB. So if your file ~= 
 64MB, you end up with 1 block, and you will only have 1 mapper.
 
 
 On Apr 19, 2012, at 10:10 PM, Sky wrote:
 
 Thanks for your reply.  After I sent my email, I found a fundamental defect 
 - in my understanding of how MR is distributed. I discovered that even 
 though I was firing off 15 COREs, the map job - which is the most expensive 
 part of my processing was run only on 1 core.
 
 To start my map job, I was creating a single file with following data:
 1 storage:/root/1.manif.txt
 2 storage:/root/2.manif.txt
 3 storage:/root/3.manif.txt
 ...
 4000 storage:/root/4000.manif.txt
 
 I thought that each of the available COREs will be assigned a map job from 
 top down from the same file one at a time, and as soon as one CORE is done, 
 it would get the next map job. However, it looks like I need to split the 
 file into the number of times. Now while that’s clearly trivial to do, I am 
 not sure how I can detect at runtime how many splits I need to do, and also 
 to deal with adding new CORES at runtime. Any suggestions? (it doesn't have 
 to be a file, it can be a list, etc).
 
 This all would be much easier to debug, if somehow I could get my log4j logs 
 for my mappers and reducers. I can see log4j for my main launcher, but not 
 sure how to enable it for mappers and reducers.
 
 Thx
 - Akash
 
 
 -Original Message- From: Robert Evans
 Sent: Thursday, April 19, 2012 2:08 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
 implementation
 
 From what I can see your implementation seems OK, especially from a 
 performance perspective. Depending on what storage: is it is likely to be 
 your bottlekneck, not the hadoop computations.
 
 Because you are writing files directly instead of relying on Hadoop to do it 
 for you, you may need to deal with error cases that Hadoop will normally 
 hide from you, and you will not be able to turn on speculative execution. 
 Just be aware that a map or reduce task may have problems in the middle, and 
 be relaunched.  So when you are writing out your updated manifest be careful 
 to not replace the old one until the new one is completely ready and will 
 not fail, or you may lose data.  You may also need to be careful in your 
 reduce if you are writing directly to the file there too, but because it is 
 not a read modify write, but just a write it is not as critical.
 
 --Bobby Evans
 
 On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote:
 
 
 
 
 Please help me architect the design of my first significant MR task beyond 
 word count. My program works well. but I am trying to optimize performance 
 to maximize use of available computing resources. I have 3 questions at the 
 bottom.
 
 Project description in an abstract sense (written in java):
 * I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
 to 4000.manif.txt
   * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS 
 (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
 storage:/root/1.manif/1223.folder/5443.Ebook.ebk
 So we are talking about millions of ebooks
 
 My task is to:
 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
 publisher, year, ebook-version).
 2. Update each of the EBOOK entry record in the manifest - with the 3 
 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01)
 3. Create a output file such that the named 
 publisher_year_ebook-version  contains a list of all ebook urls 
 that met that criteria.
 example:
 File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains:
 storage:/root/1.manif/1223.folder/2143.Ebook.ebk
 storage

Re: start hadoop slave over WAN

2012-03-30 Thread Michael Segel

Probably a timeout. 
Really, not a good idea to do this in the first place...

Sent from my iPhone

On Mar 30, 2012, at 12:35 PM, Ben Cuthbert bencuthb...@ymail.com wrote:

 Strange thing is the datanode in the remote location has a log zero bytes. So 
 nothing there.
 Its strange it is like the master does and ssh, login, and then attempts to 
 start it but nothing. Maybe there is a timeout?
 
 
 On 30 Mar 2012, at 18:22, kasi subrahmanyam wrote:
 
 Try checking the logs in the logs folder for the datanode.It might give
 some lead.
 Maybe there is a mismatch between the namespace iDs in the system and user
 itself while starting the datanode.
 
 On Fri, Mar 30, 2012 at 10:32 PM, Ben Cuthbert bencuthb...@ymail.comwrote:
 
 All
 
 We have a master in one region and we are trying to start a slave datanode
 in another region. When executing the scripts it looks to login to the
 remote host, but
 never starts the datanode. When executing hbase tho it does work. Is there
 a timeout or something with hadoop?

Re: where are my logging output files going to?

2012-03-28 Thread Michael Segel

You don't want users actually running anything directly on the cluster. 
You would set up some machine to launch jobs. 
Essentially any sort of Linux machine where you can install Hadoop, but you 
don't run any jobs...

Sent from my iPhone

On Mar 28, 2012, at 3:30 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

 what do you mean by an edge node? do you mean any node that is not the
 master node (or NameNode or JobTracker node)?
 
 On Wed, Mar 28, 2012 at 3:51 AM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 First you really don't want to launch the job from the cluster but from an
 edge node.
 
 To answer your question, in a word, yes, you should have a consistent set
 of configuration files as possible, noting that overtime this may not be
 possible as hardware configs may change,
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Mar 27, 2012, at 8:42 PM, Jane Wayne jane.wayne2...@gmail.com wrote:
 
 if i have a hadoop cluster of 10 nodes, do i have to modify the
 /hadoop/conf/log4j.properties files on ALL 10 nodes to be the same?
 
 currently, i ssh into the master node to execute a job. this node is the
 only place where i have modified the logj4.properties file. i notice that
 although my log files are being created, nothing is being written to
 them.
 when i test on cygwin, the logging works, however, when i go to a live
 cluster (i.e. amazon elastic mapreduce), the logging output on the master
 node no longer works. i wonder if logging is happening at each slave/task
 node?
 
 could someone explain logging or point me to the documentation discussing
 this issue?

Re: MR job launching is slower

2012-03-20 Thread Michael Segel

Hi,

First, it sounds like you have 2 6 core CPUs giving you 12 cores not 24. 
Even though the OS reports 24 cores that's the hyper threading and not real 
cores. 
This becomes an issue with respect to tuning. 

To answer your question ... 

You have a single 1TB HD. That's going to be a major bottleneck in terms of 
performance.  You usually want to have at least 1 drive per core.  With a 12 
core box that's 12 spindles.

How large is your hadoop job's jar? This gets pushed around to all of the 
nodes. 
Bigger jars take longer to process and handle. 

Having said that, the start up time isn't out of whack. 
It depends on what job you're launching and what you are doing within the job. 
Remember that the tasks have to report back to the JT.

Do you have Ganglia up and running? 
You will probably see a high load on the CPUs and then a lot of Wait IOs. 

HTH

-Mike

On Mar 20, 2012, at 5:40 AM, praveenesh kumar wrote:

 I have 10 node cluster ( around 24 CPUs, 48 GB RAM, 1 TB HDD, 10 GB
 ethernet connection)
 After triggering any MR job, its taking like 3-5 seconds to launch ( I mean
 the time when I can see any MR job completion % on the screen).
 I know internally its trying to launch the job,intialize mappers, loading
 data etc.
 What I want to know - Is it a default/desired/expected hadoop behavior or
 there are ways in which I can decrease this startup time ?
 
 Also I feel like my hadoop jobs should run faster, but I am still not able
 to make it as fast as it should be according to me ?
 I did some tunning also, following are the parameters I am playing around
 these days but still I feel there are something missing that I can still
 use:
 
 dfs.block.size:
 
 mapred.compress.map.output
 
 mapred.map/reduce.tasks.speculative.execution
 
 mapred.tasktracker.map/reduce.tasks.maximum:
 
 mapred.child.java.opts
 
 io.sort.mb:
 
 io.sort.factor:
 
 mapred.reduce.parallel.copies:
 
 mapred.job.reuse.jvm.num.tasks:
 
 
 Thanks,
 Praveenesh

Re: Issue when starting services on CDH3

2012-03-15 Thread Michael Segel

Are you running the init.d scripts as root and what is order of the services 
you want to start.


Sent from my iPhone

On Mar 15, 2012, at 11:22 AM, Manish Bhoge manishbh...@rocketmail.com wrote:

 Ys, I understand the order and I formatted namenode before starting services. 
 As I suspect there may be ownership and an access issue. Not able to nail 
 down issue exactly. I also have question why there are 2 routes to start 
 services. When we have start-all.sh script then why need to go to init.d to 
 start services??
 
 
 Thank you,
 Manish
 Sent from my BlackBerry, pls excuse typo
 
 -Original Message-
 From: Manu S manupk...@gmail.com
 Date: Thu, 15 Mar 2012 21:43:26 
 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Issue when starting services on CDH3
 
 Did you check the service status?
 Is it like dead, but pid exist?
 
 Did you check the ownership and permissions for the
 dfs.name.dir,dfs.data.dir,mapped.local.dir etc ?
 
 The order for starting daemons are like this:
 1 namenode
 2 datanode
 3 jobtracker
 4 tasktracker
 
 Did you format the namenode before starting?
 On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote:
 
 Dear manish
 Which daemons are not starting?
 
 On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com
 wrote:
 
 I have CDH3 installed in standalone mode. I have install all hadoop
 components. Now when I start services (namenode,secondary namenode,job
 tracker,task tracker) I can start gracefully from /usr/lib/hadoop/
 ./bin/start-all.sh. But when start the same servises from
 /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start
 Hue also which is in init.d that also I couldn't start. Here I suspect
 authentication issue. Because all the services in init.d are under root
 user and root group. Please suggest I am stuck here. I tried hive and it
 seems it running fine.
 Thanks
 Manish.
 Sent from my BlackBerry, pls excuse typo

Re: Hadoop pain points?

2012-03-04 Thread Michael Segel

What?
The lack of documentation is what made Hadoop, really HBase, a lot of fun:-)
You know what they say... Not guts, no glory...

I'm sorry, while I agree w Harsh, I just don't want to sound like some old guy 
talking about how when they were young, they had to walk in chest high snow, in 
a blizzard, uphill (both ways)to and from school ... And how you newbies have 
it so much better...

;-P

Sent from my iPhone

On Mar 2, 2012, at 6:42 PM, Russell Jurney russell.jur...@gmail.com wrote:

 +2
 
 Russell Jurney http://datasyndrome.com
 
 On Mar 2, 2012, at 4:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 
 +1
 
 On Fri, Mar 2, 2012 at 4:09 PM, Harsh J ha...@cloudera.com wrote:
 
 Since you ask about anything in general, when I forayed into using
 Hadoop, my biggest pain was lack of documentation clarity and
 completeness over the MR and DFS user APIs (and other little points).
 
 It would be nice to have some work done to have one example or
 semi-example for every single Input/OutputFormat, Mapper/Reducer
 implementations, etc. added to the javadocs.
 
 I believe examples and snippets help out a ton (tons more than
 explaining just behavior) to new devs.
 
 On Fri, Mar 2, 2012 at 9:45 PM, Kunaal kunalbha...@alumni.cmu.edu wrote:
 I am doing a general poll on what are the most prevalent pain points that
 people run into with Hadoop? These could be performance related (memory
 usage, IO latencies), usage related or anything really.
 
 The goal is to look for what areas this platform could benefit the most
 in
 the near future.
 
 Any feedback is much appreciated.
 
 Thanks,
 Kunal.
 
 
 
 --
 Harsh J

Re: Hadoop Datacenter Setup

2012-01-30 Thread Michael Segel

If you are going this route why not net boot the nodes in the cluster?

Sent from my iPhone

On Jan 30, 2012, at 8:17 PM, Patrick Angeles patrickange...@gmail.com wrote:

Hey Aaron,

I'm still skeptical when it comes to flash drives, especially as pertains
to Hadoop. The write cycle limit is impractical to make them usable for
dfs.data.dir and mapred.local.dir, and as you pointed out, you can't use
them for logs either.

If you put HADOOP_LOG_DIR in /mnt/d0, you will still have to shut down the
TT and DN in order to replace the drive. So you may as well just carve out
100GB from that drive and put your root filesystem there.

I'd say that unless you're running some extremely CPU-heavy workloads, you
should consider getting more than 3 drives per node. Most shops get 6-12
drives per node (with dual quad or hex core processors). Then you can
sacrifice one of the drives for swap and the OS.

I'd keep the RegionServer heap at 12GB or under to mitigate long GC pauses
(the bigger the heap, the longer the eventual full GC).

Finally, you can run Hive on the same cluster as HBase, just be wary of
load spikes due to MR jobs and configure properly. You don't want a large
Hive query to knock out your RegionServers thereby causing cascading
failures.

- P

On Mon, Jan 30, 2012 at 6:44 PM, Aaron Tokhy
aaron.to...@resonatenetworks.com wrote:

I forgot to add:

Are there use cases for using a swap partition for Hadoop nodes if our
combined planned heap size is not expected to go over 24GB for any
particular node type? I've noticed that if HBase starts to GC, it will
pause for unreasonable amounts of time if old pages get swapped to disk,
causing the regionserver to crash (which we've mitigated by setting
vm.swappiness=5).

Our slave node template will have a 1 GB heap Task Tracker, a 1 GB heap
Data Node and a 12-16GB heap RegionServer. We assume the OS memory
overhead is 1 GB. We added another 1 GB for combined Java VM overhead
across services, which comes up to be around a max of 16-20GB used. This
gives us around 4-8GB for tasks that would work with HBase. We may also
use Hive on the same cluster for queries.

On 01/30/2012 05:40 PM, Aaron Tokhy wrote:

Hi,

Our group is trying to set up a prototype for what will eventually
become a cluster of ~50 nodes.

Anyone have experiences with a stateless Hadoop cluster setup using this
method on CentOS? Are there any caveats with a read-only root file
system approach? This would save us from having to keep a root volume
on every system (whether it is installed on a USB thumb drive, or a RAID
1 of bootable / partitions).

http://citethisbook.net/Red_**Hat_Introduction_to_Stateless_**Linux.htmlhttp://citethisbook.net/Red_Hat_Introduction_to_Stateless_Linux.html

We would like to keep the OS root file system separate from the Hadoop
filesystem(s) for maintenance reasons (we can hot swap disks while the
system is running)

We were also considering installing the root filesystem on USB flash
drives, making it persistent yet separate. However we would identify
and turn off anything that would cause excess writes to the root
filesystem given the limited number of USB flash drive write cycles
(keep IO writes to the root filesystem to a minimum). We would do this
by storing the Hadoop logs on the same filesystem/drive as what we
specify in dfs.data.dir/dfs.name.dir.

In the end we would have something like this:

USB (MS DOS partition table + 1 ext2/3/4 partition)
/dev/sda
/dev/sda1mounted as /(possibly read-only)
/dev/sda2mounted as /var(read-write)
/dev/sda3mounted as /tmp(read-write)

Hadoop Disks (no partition table or GPT since these are 3TB disks)
/dev/sdb/mnt/d0
/dev/sdc/mnt/d1
/dev/sdd/mnt/d2

/mnt/d0 would contain all Hadoop logs.

Hadoop configuration files would still reside on /

Any issues with such a setup? Are there better ways of achieving this?

Re: Send Data and Message to all nodes

2012-01-29 Thread Michael Segel

Unless I'm missing something, it sounds like the OP wants to chain jobs where 
the results from one job are the input to another...

Of course it's Sun morning and I haven't had my first cup of coffee so I could 
be misinterpreting the OP's question.

If the OP wanted to send the data to each node and use it as a lookup table, 
his initial output is on HDFS so he could just open the file and read it in to 
memory in Mapper.setup().  
Note: if the file is too big, then you probably wouldn't want to use 
distributed cache any way...


Sent from my iPhone

On Jan 28, 2012, at 7:11 PM, Ravi Prakash ravihad...@gmail.com wrote:

 Take a look at distributed cache for distributing data to all nodes. I'm
 not sure what you mean by messages. The MR programming paradigm is
 different from MPI.
 
 http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache
 
 On Sat, Jan 28, 2012 at 5:52 AM, Oliaei oli...@gmail.com wrote:
 
 
 Hi,
 
 I want to run a MR procedure under Hadoop and then send some messages 
 data
 to all of nodes and after that run anther MR.
 What's the easiest way for sending data to all or some nodes? Or Is there
 any way to do that under Hadoop without using other frameworks?
 
 Regards,
 
 Oliaei
 
 oli...@gmail.com
 
 --
 View this message in context:
 http://old.nabble.com/Send-Data-and-Message-to-all-nodes-tp33219535p33219535.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Michael Segel

BigInsights? ... Ok, I'll be nice ...  :-)

Ok, so of I understand your question, you want to use a single HDFS file system 
to be used by different 'Hadoop' frameworks ? (derivatives)

First, it doesn't make sense. I mean it really doesn't make any sense.

Second.. I don't think it would be possible except in the rare case that the 
two flavors of Hadoop were from the same code stream and similar release level. 
As a hypothetical example, Oracle forks their own distro from Cloudera but 
makes relatively few changes under the hood.

But getting back to the first point... Not a good idea when you considers that 
it violates the KISS principle to design.

IMHO, you would be better off w two clusters using distcp.

Sent from my iPhone

On Jan 25, 2012, at 5:38 AM, Romeo Kienzler ro...@ormium.de wrote:

 Dear List,
 
 we're trying to use a central HDFS storage in order to be accessed from 
 various other Hadoop-Distributions.
 
 Do you think this is possible? We're having trouble, but not related to 
 different RPC-Versions.
 
 When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from BigInsights 
 1.3 we're getting this error:
 
 Bad connection to FS. Command aborted. Exception: Call to 
 localhost.localdomain/127.0.0.1:50070 failed on local exception: 
 java.io.EOFException
 java.io.IOException: Call to localhost.localdomain/127.0.0.1:50070 failed on 
 local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
at org.apache.hadoop.ipc.Client.call(Client.java:1110)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:213)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:180)
at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
 com.ibm.biginsights.hadoop.patch.PatchedDistributedFileSystem.initialize(PatchedDistributedFileSystem.java:19)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:82)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1785)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1939)
 Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
 
 
 But we've already replaced the client hadoop-common.jar's with the Cloudera 
 ones.
 
 Please note also that we're getting an EOFException and not an 
 RPC.VersionMismatch.
 
 FsShell.java:
 
try {
init();
} catch (RPC.VersionMismatch v) {
System.err.println(Version Mismatch between client and server
+ ... command aborted.);
return exitCode;
} catch (IOException e) {
System.err.println(Bad connection to FS. command aborted.);
System.err
.println(Bad connection to FS. Command aborted. 
 Exception: 
+ e.getLocalizedMessage());
e.printStackTrace();
return exitCode;
}
 
 Any ideas?

Re: Connect to HDFS running on a different Hadoop-Version

2012-01-25 Thread Michael Segel

Alex,
I said I would be nice and hold my tongue when it comes to IBM and their IM
pillar products... :-)

You could write a client that talks to two different hadoop versions but then
you would be using hftp which is what you have under the hood in distcp...

But that doesn't seem to be what he wants to do... I can only imagine why he is
asking this question... ;-)

Sent from my iPhone

On Jan 25, 2012, at 7:32 AM, alo alt wget.n...@googlemail.com wrote:

Insight is a IBM related product, based on an fork of hadoop I think. The
mixing of totally different stacks make no sense. And will not work, I guess.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 25, 2012, at 1:12 PM, Harsh J wrote:

Hello Romeo,

Inline…

On Wed, Jan 25, 2012 at 4:07 PM, Romeo Kienzler ro...@ormium.de wrote:
Dear List,

we're trying to use a central HDFS storage in order to be accessed from
various other Hadoop-Distributions.

The HDFS you've setup, what 'distribution' is that from? You will have
to use that particular version's jar across all client applications
you use, else you'll run into RPC version incompatibilities.

Do you think this is possible? We're having trouble, but not related to
different RPC-Versions.

It should be possible _most of the times_ by replacing jars at the
client end to use the one that runs your cluster, but there may be
minor API incompatibilities between certain versions that can get in
the way. Purely depends on your client application and its
implementation. If it sticks to using the publicly supported APIs, you
are mostly fine.

When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from
BigInsights 1.3 we're getting this error:

BigInsights runs off IBM's own patched Hadoop sources if I am right,
and things can get a bit tricky there. See the following points:

Bad connection to FS. Command aborted. Exception: Call to
localhost.localdomain/127.0.0.1:50070 failed on local exception:
java.io.EOFException
java.io.IOException: Call to localhost.localdomain/127.0.0.1:50070 failed on
local exception: java.io.EOFException

This is surely an RPC issue. The call tries to read off a field, but
gets no response, EOFs and dies. We have more descriptive error
messages with the 0.23 version onwards, but the problem here is that
your IBM client jar is not the same as your cluster's jar. The mixture
won't work.

com.ibm.biginsights.hadoop.patch.PatchedDistributedFileSystem.initialize(PatchedDistributedFileSystem.java:19)

^^ This is what am speaking of. Your client (BigInsights? Have not
used it really…) is using an IBM jar with their supplied
'PatchDistributedFileSystem', and that is probably incompatible with
the cluster's HDFS RPC protocols. I do not know enough about IBM's
custom stuff to know for sure it would work if you replace it with
your clusters' jar.

But we've already replaced the client hadoop-common.jar's with the Cloudera
ones.

Apparently not. Your strace shows that com.ibm.* classes are still
being pulled. My guess is that BigInsights would not work with
anything non IBM, but I have not used it to know for sure.

If they have a user community, you can ask there if there is a working
way to have BigInsights run against Apache/CDH/etc. distributions.
For CDH specific questions, you may ask at
https://groups.google.com/a/cloudera.org/group/cdh-user/topics instead
of the Apache lists here.

--
Harsh J
Customer Ops. Engineer, Cloudera

Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs

2012-01-20 Thread Michael Segel

Steve, 
Ok, first your client connection to the cluster is a non issue.

If you go in to /etc/Hadoop/conf 
That supposed to be a little h but my iPhone knows what's best...

Look and see what you have set for your bandwidth... I forget which parameter 
but there are only a couple that deal with bandwidth.
I think it's set to 1mb or 10mb by default. You need to up it to 100-200mb if 
you're on a 1 GB network .

That would solve you balancing issue.

See if that helps...

Sent from my iPhone

On Jan 20, 2012, at 4:57 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Steve,
 If you want me to debug your code, I'll be glad to set up a billable
 contract... ;-)
 
 What I am willing to do is to help you to debug your code..
 
 
 The code seems to work well for small input files and is basically a
 standard sample.
 
 .
 
 Did you time how long it takes in the Mapper.map() method?
 The reason I asked this is to first confirm that you are failing within a
 map() method.
 It could be that you're just not updating your status...
 
 
 The map map method starts out running very fast - generateSubstrings - the
 only interesting part runs in milliseconds. The only other thing the mapper
 does is context,write which SHOULD update status
 
 
 You said that you are writing many output records for a single input.
 
 So let's take a look at your code.
 Are all writes of the same length? Meaning that in each iteration of
 Mapper.map() you will always write. K number of rows?
 
 
 Because in my sample the input strings are the same length - every call to
 the mapper will write the same number of records
 
 
 If so, ask yourself why some iterations are taking longer and longer?
 
 
 I believe the issue may relate to local storage getting filled and Hadoop
 taking a LOT of time to rebalance the output, Assuming the string length is
 the same on each map there is no reason for some iterations to me longer
 than others
 
 
 Note: I'm assuming that the time for each iteration is taking longer than
 the previous...
 
 I assume so as well since in m,y cluster the first 50% of mapping goes
 pretty fast
 
 Or am I missing something?
 
 How do I get timing of map iteratons??
 
 -Mike
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Jan 20, 2012, at 11:16 AM, Steve Lewis lordjoe2...@gmail.com wrote:
 
 We have been having problems with mappers timing out after 600 sec when
 the
 mapper writes many more, say thousands of records for every
 input record - even when the code in the mapper is small and fast. I
 no idea what could cause the system to be so slow and am reluctant to
 raise
 the 600 sec limit without understanding why there should be a timeout
 when
 all MY code is very fast.
 P
 I am enclosing a small sample which illustrates the problem. It will
 generate a 4GB text file on hdfs if the input file does not exist or is
 not
 at least that size and this will take some time (hours in my
 configuration)
 - then the code is essentially wordcount but instead of finding and
 emitting words - the mapper emits all substrings of the input data - this
 generates a much larger output data and number of output records than
 wordcount generates.
 Still, the amount of data emitted is no larger than other data sets I
 know
 Hadoop can handle.
 
 All mappers on my 8 node cluster eventually timeout after 600 sec - even
 though I see nothing in the code which is even a little slow and suspect
 that any slow behavior is in the  called Hadoop code. This is similar to
 a
 problem we have in bioinformatics where a  colleague saw timeouts on his
 50
 node cluster.
 
 I would appreciate any help from the group. Note - if you have a text
 file
 at least 4 GB the program will take that as an imput without trying to
 create its own file.
 /*
 
 
 */
 import org.apache.hadoop.conf.*;
 import org.apache.hadoop.fs.*;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapreduce.*;
 import org.apache.hadoop.mapreduce.lib.input.*;
 import org.apache.hadoop.mapreduce.lib.output.*;
 import org.apache.hadoop.util.*;
 
 import java.io.*;
 import java.util.*;
 /**
 * org.systemsbiology.hadoop.SubstringGenerator
 *
 * This illustrates an issue we are having where a mapper generating a
 much larger volume of
 * data ans number of records times out even though the code is small,
 simple and fast
 *
 * NOTE!!! as written the program will generate a 4GB file in hdfs with
 good input data -
 * this is done only if the file does not exist but may take several
 hours. It will only be
 * done once. After that the failure is fairly fast
 *
 * What this will do is count  unique Substrings of lines of length
 * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
 * substrings and  then using the word could algorithm
 * What is interesting is that the

Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs

2012-01-20 Thread Michael Segel

Thats the one ...

Sent from my iPhone

On Jan 20, 2012, at 6:28 PM, Paul Ho p...@walmart.com wrote:

 I think the balancing bandwidth property you are looking for is in 
 hdfs-site.xml:
 
property
namedfs.balance.bandwidthPerSec/name
value402653184/value
/property
 
 Set the value that makes most sense for your NIC. But I thought this is only 
 for balancing.
 
 On Jan 20, 2012, at 3:43 PM, Michael Segel wrote:
 
 Steve,
 Ok, first your client connection to the cluster is a non issue.
 
 If you go in to /etc/Hadoop/conf
 That supposed to be a little h but my iPhone knows what's best...
 
 Look and see what you have set for your bandwidth... I forget which 
 parameter but there are only a couple that deal with bandwidth.
 I think it's set to 1mb or 10mb by default. You need to up it to 100-200mb 
 if you're on a 1 GB network .
 
 That would solve you balancing issue.
 
 See if that helps...
 
 Sent from my iPhone
 
 On Jan 20, 2012, at 4:57 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 
 On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Steve,
 If you want me to debug your code, I'll be glad to set up a billable
 contract... ;-)
 
 What I am willing to do is to help you to debug your code..
 
 
 The code seems to work well for small input files and is basically a
 standard sample.
 
 .
 
 Did you time how long it takes in the Mapper.map() method?
 The reason I asked this is to first confirm that you are failing within a
 map() method.
 It could be that you're just not updating your status...
 
 
 The map map method starts out running very fast - generateSubstrings - the
 only interesting part runs in milliseconds. The only other thing the mapper
 does is context,write which SHOULD update status
 
 
 You said that you are writing many output records for a single input.
 
 So let's take a look at your code.
 Are all writes of the same length? Meaning that in each iteration of
 Mapper.map() you will always write. K number of rows?
 
 
 Because in my sample the input strings are the same length - every call to
 the mapper will write the same number of records
 
 
 If so, ask yourself why some iterations are taking longer and longer?
 
 
 I believe the issue may relate to local storage getting filled and Hadoop
 taking a LOT of time to rebalance the output, Assuming the string length is
 the same on each map there is no reason for some iterations to me longer
 than others
 
 
 Note: I'm assuming that the time for each iteration is taking longer than
 the previous...
 
 I assume so as well since in m,y cluster the first 50% of mapping goes
 pretty fast
 
 Or am I missing something?
 
 How do I get timing of map iteratons??
 
 -Mike
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Jan 20, 2012, at 11:16 AM, Steve Lewis lordjoe2...@gmail.com wrote:
 
 We have been having problems with mappers timing out after 600 sec when
 the
 mapper writes many more, say thousands of records for every
 input record - even when the code in the mapper is small and fast. I
 no idea what could cause the system to be so slow and am reluctant to
 raise
 the 600 sec limit without understanding why there should be a timeout
 when
 all MY code is very fast.
 P
 I am enclosing a small sample which illustrates the problem. It will
 generate a 4GB text file on hdfs if the input file does not exist or is
 not
 at least that size and this will take some time (hours in my
 configuration)
 - then the code is essentially wordcount but instead of finding and
 emitting words - the mapper emits all substrings of the input data - this
 generates a much larger output data and number of output records than
 wordcount generates.
 Still, the amount of data emitted is no larger than other data sets I
 know
 Hadoop can handle.
 
 All mappers on my 8 node cluster eventually timeout after 600 sec - even
 though I see nothing in the code which is even a little slow and suspect
 that any slow behavior is in the  called Hadoop code. This is similar to
 a
 problem we have in bioinformatics where a  colleague saw timeouts on his
 50
 node cluster.
 
 I would appreciate any help from the group. Note - if you have a text
 file
 at least 4 GB the program will take that as an imput without trying to
 create its own file.
 /*
 
 
 */
 import org.apache.hadoop.conf.*;
 import org.apache.hadoop.fs.*;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapreduce.*;
 import org.apache.hadoop.mapreduce.lib.input.*;
 import org.apache.hadoop.mapreduce.lib.output.*;
 import org.apache.hadoop.util.*;
 
 import java.io.*;
 import java.util.*;
 /**
 * org.systemsbiology.hadoop.SubstringGenerator
 *
 * This illustrates an issue we are having where a mapper generating a
 much larger volume of
 * data ans number of records times out even though the code is small,
 simple and fast
 *
 * NOTE

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

2012-01-18 Thread Michael Segel

But Steve, it is your code... :-)

Here is a simple test...

Set your code up where the run fails...

Add a simple timer to see how long you spend in the Mapper.map() method.

only print out the time if its greater than lets say 500 seconds...

The other thing is to update a dynamic counter in Mapper.map().
This would force a status update to be sent to the JT.

Also you dont give a lot of detail...
Are you writing out to an HBase table???

HTH

-Mike

On Jan 18, 2012, at 6:21 PM, Steve Lewis wrote:

 1) I do a lot of progress reporting
 2) Why would the job succeed when the only change in the code is
  if(NumberWrites++ % 100 == 0)
  context.write(key,value);
 comment out the test  allowing full writes and the job fails
 Since every write is a report I assume that something in the write code or
 other hadoop code for dealing with output if failing. I do increment a
 counter for every write or in the case of the above code potential write
 What I am seeing is that where ever the timeout occurs it is not in a place
 where I am capable of inserting more reporting
 
 
 
 On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina lurb...@mit.edu wrote:
 
 Perhaps you are not reporting progress throughout your task. If you
 happen to run a job large enough job you hit the the default timeout
 mapred.task.timeout  (that defaults to 10 min). Perhaps you should
 consider reporting progress in your mapper/reducer by calling
 progress() on the Reporter object. Check tip 7 of this link:
 
 http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
 
 Hope that helps,
 -Leo
 
 Sent from my phone
 
 On Jan 18, 2012, at 6:46 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 
 I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
 the
 number of writes causes it to go away. It seems to imply that some
 context.write operation or something downstream from that is taking a
 huge
 amount of time and that is all hadoop internal code - not mine so my
 question is why should increasing the number and volume of wriotes cause
 a
 task to time out
 
 On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez t...@supertom.com wrote:
 
 Sounds like mapred.task.timeout?  The default is 10 minutes.
 
 http://hadoop.apache.org/common/docs/current/mapred-default.html
 
 Thanks,
 
 Tom
 
 On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
 The map tasks fail timing out after 600 sec.
 I am processing one 9 GB file with 16,000,000 records. Each record
 (think
 is it as a line)  generates hundreds of key value pairs.
 The job is unusual in that the output of the mapper in terms of records
 or
 bytes orders of magnitude larger than the input.
 I have no idea what is slowing down the job except that the problem is
 in
 the writes.
 
 If I change the job to merely bypass a fraction of the context.write
 statements the job succeeds.
 This is one map task that failed and one that succeeded - I cannot
 understand how a write can take so long
 or what else the mapper might be doing
 
 JOB FAILED WITH TIMEOUT
 
 *Parser*TotalProteins90,103NumberFragments10,933,089
 
 
 *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
 *Map-Reduce Framework*Combine output records10,033,499Map input records
 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
 input
 records10,844,881Map output records10,933,089
 Same code but fewer writes
 JOB SUCCEEDED
 
 *Parser*TotalProteins90,103NumberFragments206,658,758
 *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
 FILE_BYTES_WRITTEN220,169,922
 *Map-Reduce Framework*Combine output records4,046,128Map input
 records90,103Spilled
 Records4,046,128Map output bytes662,354,413Combine input
 records4,098,609Map
 output records2,066,588
 Any bright ideas
 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com
 
 
 
 
 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com
 
 
 
 
 -- 
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com

Re: Does hadoop installations need to be at same locations in cluster ?

2011-12-23 Thread Michael Segel

Sure,
You could do that, but in doing so, you will make your life a living hell.
Literally.

Think about it... You will have to manually manage each nodes config files...

So if something goes wrong you will have a hard time diagnosing the issue.

Why make life harder?

Why not just do the simple think and make all of your DN the same?

Sent from my iPhone

On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com wrote:

 When installing hadoop on slave machines, do we have to install hadoop
 at same locations on each machine ?
 Can we have hadoop installation at different location on different
 machines at same cluster ?
 If yes, what things we have to take care in that case
 
 Thanks,
 Praveenesh

Re: Hadoop configuration

2011-12-23 Thread Michael Segel

Class project due?
Sorry, second set of questions on setting up a 2 node cluster...

Sent from my iPhone

On Dec 22, 2011, at 3:25 AM, Humayun kabir humayun0...@gmail.com wrote:

 someone please help me to configure hadoop such as core-site.xml,
 hdfs-site.xml, mapred-site.xml etc.
 please provide some example. it is badly needed. because i run in a 2 node
 cluster. when i run the wordcount example then it gives the result too
 mutch fetch failure.

RE: Does hadoop installations need to be at same locations in cluster ?

2011-12-23 Thread Michael Segel


Ok,

Here's the thing...

1) When building the cluster, you want to be consistent. 
2) Location of $HADOOP_HOME is configurable. So you can place it anywhere.

Putting the software in two different locations isn't a good idea because you 
now have to set it up with a unique configuration per node. 

It would be faster and make your life a lot easier by putting the software in 
the same location on *all* machines. 
So my suggestion would be to bite the bullet and rebuild your cluster. 

HTH

-Mike


 Date: Fri, 23 Dec 2011 19:47:45 +0530
 Subject: Re: Does hadoop installations need to be at same locations in 
 cluster ?
 From: praveen...@gmail.com
 To: common-user@hadoop.apache.org
 
 What I mean to say is, Does hadoop internally assumes that all
 installations on each nodes need to be in same location.
 I was having hadoop installed on different location on 2 different nodes.
 I configured  hadoop config files to be a part of same cluster.
 But when I started hadoop on master, I saw it was also searching for
 hadoop starting scripts in the same location as of master.
 Do we have any workaround in these kind of situation or do I have to
 reinstall hadoop again on same location as master.
 
 Thanks,
 Praveenesh
 
 On Fri, Dec 23, 2011 at 6:26 PM, Michael Segel
 michael_se...@hotmail.com wrote:
  Sure,
  You could do that, but in doing so, you will make your life a living hell.
  Literally.
 
  Think about it... You will have to manually manage each nodes config 
  files...
 
  So if something goes wrong you will have a hard time diagnosing the issue.
 
  Why make life harder?
 
  Why not just do the simple think and make all of your DN the same?
 
  Sent from my iPhone
 
  On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com 
  wrote:
 
  When installing hadoop on slave machines, do we have to install hadoop
  at same locations on each machine ?
  Can we have hadoop installation at different location on different
  machines at same cluster ?
  If yes, what things we have to take care in that case
 
  Thanks,
  Praveenesh

RE: More cores Vs More Nodes ?

2011-12-15 Thread Michael Segel



Tom,

Look, 
I've said this before and I'm going to say it again.

Your knowledge of Hadoop is purely academic. It may be ok to talk to C level 
execs who visit the San Jose IM Lab or in Markham, but when you give answers on 
issues you don't have first hand practical experience, you end up doing more 
harm than good.

The problem is that too many people blindly except what they see on the web as 
fact when its not always accurate and may not suit their needs.
I've lost count on the number of hours I've spent in meetings trying to undo 
the damage cause by someone saying ... but FB does it this way...therefore 
that's how we should do it.

Now Michael St.Ack is a pretty smart guy. He knows his shit. He's extremely 
credible. However when he says that FB does something a specific way, that is 
because FB has certain requirements and the solution works for them. It doesn't 
mean that it will be the best solution for your customer/client.

And Tom, if we pull out your business card, you have a nice fancy title with 
IBM. So you instantly have some credibility. Unfortunately, you're no St.Ack.  
(I'd put a smile face but I'm actually trying to be serious.)

Even in this post, you continue to go down the wrong path. 
Unfortunately I don't have time to lecture you on why what you said is wrong 
and that your thoughts on cluster design are way off base. 
Oh and I tease you because frankly, you deserve it. 

I have to apologize to everyone on the list, but in the past, you failed to 
actually stop and take the hint that maybe you need to rethink your views on 
Hadoop.  That had you had practical experience setting up actual clusters (Not 
EC2 clusters) you would have the necessary understanding of what can go wrong 
and how to fix it. 

If I get time, I'll have to find my copy of Up Front by Bill Maudlin. There's 
a cartoon that really fits you.

Later


 To: common-user@hadoop.apache.org
 Subject: RE: More cores Vs More Nodes ?
 From: tdeut...@us.ibm.com
 Date: Wed, 14 Dec 2011 11:40:51 -0800
 
 Your eagerness to insult is throwing you off track here Michael. 
 
 For example, the workload profile of a cluster doing heavy NLP is very 
 different than one doing serving as a destination for large scale 
 application/web logs. Ditto for PC risk modeling vs smart meter use 
 cases, etc etc...Those are not general purpose clusters. You may - and 
 should I'd say - have the NLP use cases in a common analytics environment 
 (internal cloud model) for sharing of methods/skills, but putting 
 orthogonal use cases on that cluster is not inherently a best practice.
 
 How those clusters should be built does vary, and no it is not uncommon to 
 have focused use cases like that. If you know it is going to be a general 
 purpose cluster then do build it in a balanced spec. 
 
 
 
 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel


Sorry,
But having read the thread, I am going to have to say that this is definitely a 
silly question.
NOTE THE FOLLOWING: Silly questions are not a bad thing. I happen to ask them 
all the time. ;-)

Here's why I say its a silly question...

Hadoop is a cost effective solution when you build out 'commodity' servers. 
Now here's the rub. 
Commodity servers means something different to each person, and I don't want to 
get in to a debate on its definition.

When building out a cluster, too many people gloss over the complexity. 1U vs 
2U in box size. Do you 1/2 MB or full size MB. How many disks per node. How 
much memory. Physical plant limitations. (Available rack space, costs if this 
is going in to a colo...) Power consumption, budget...

At a client, back in 2009, our first cluster was build on whatever hardware we 
could get. It was 5 blade servers w SCSI/SAS 2.5 disks where we split each 
blade so we could have 10 nodes. Yeah, it was a mistake and a royal pain. But 
we got the cluster up and could do some simple PoCs. But we then came up with 
our reference architecture for further PoCs and development. 
We build out the DN w 8 core, 32GB, and 4 x 2TB 3.5 drives. Why? Because based 
on our constraints, this gave us the optimal  combination w price and 
performance. Note: We knew we would leave some performance on the table. It was 
a conscious decision to leave some performance on the table so that we could 
maximize the number of nodes to fit within out budget.

We chose 2TB drives because at the time they offered the best price/performance 
ratio. Today, that may be different.
We chose 32GB because at the time it was the sweet spot in memory prices. Today 
w 3 channel memory it looks like 36GB is the sweet spot. Of course YMMV. (It 
could be 48GB...)

Moving forward, I would reconsider the design because the price points on 
hardware has changed. 

That's going to be your driving factor. 

You want to look at 64 Core boxes, then you need 256GB of memory. Think of how 
many disks you have to add. (64-128 disks)
Now then ask yourself is this a commodity box?

Now price that box out.
Then price out how many 8 core 1U boxes you can buy.

Kind of puts it in to perspective, doesn't it? ;-)

The reason why I call this a 'silly question' is that you're attempting to look 
at your cluster by focusing on only one variable. 
This is not to say that its a bad question because it forces you to realize 
that there are definitely lots of other options.  that you have to consider.

HTH

-Mike
 

 Date: Tue, 13 Dec 2011 20:25:17 -0600
 Subject: Re: More cores Vs More Nodes ?
 From: airb...@gmail.com
 To: common-user@hadoop.apache.org
 
 Hi Brad
 
 This is a really interesting experiment. I am curious why you did not use 2
 cores each machine but 32 nodes. That makes the number of CPU core in two
 groups equal.
 
 Chen
 
 On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield b...@bing.com wrote:
 
  Hi Prashant,
 
  In each case I had a single tasktracker per node. I oversubscribed the
  total tasks per tasktracker/node by 1.5 x # of cores.
 
  So for the 64 core allocation comparison.
 In A: 8 cores; Each machine had a single tasktracker with 8 maps /
  4 reduce slots for 12 task slots total per machine x 8 machines (including
  head node)
 In B: 2 c   ores; Each machine had a single tasktracker with 2
  maps / 1 reduce slots for 3 slots total per machines x 29 machines
  (including head node which was running 8 cores)
 
  The experiment was done in a cloud hosted environment running set of VMs.
 
  ~Brad
 
  -Original Message-
  From: Prashant Kommireddi [mailto:prash1...@gmail.com]
  Sent: Tuesday, December 13, 2011 9:46 AM
  To: common-user@hadoop.apache.org
  Subject: Re: More cores Vs More Nodes ?
 
  Hi Brad, how many taskstrackers did you have on each node in both cases?
 
  Thanks,
  Prashant
 
  Sent from my iPhone
 
  On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:
 
   Praveenesh,
  
   Your question is not naïve; in fact, optimal hardware design can
  ultimately be a very difficult question to answer on what would be
  better. If you made me pick one without much information I'd go for more
  machines.  But...
  
   It all depends; and there is no right answer :)
  
   More machines
  +May run your workload faster
  +Will give you a higher degree of reliability protection from node /
  hardware / hard drive failure.
  +More aggregate IO capabilities
  - capex / opex may be higher than allocating more cores More cores
  +May run your workload faster
  +More cores may allow for more tasks to run on the same machine
  +More cores/tasks may reduce network contention and increase
  increasing task to task data flow performance.
  
   Notice May run your workload faster is in both; as it can be very
  workload dependant.
  
   My Experience:
   I did a recent experiment and found that given the same number of cores
  (64) with the exact

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel


Aw Tommy, 
Actually no. You really don't want to do this.

If you actually ran a cluster and worked in the real world, you would find that 
if you purposely build a cluster for one job, there will be a mandate that some 
other group needs to use the cluster and that their job has different 
performance issues and your cluster is now suboptimal for their jobs...

Perhaps you meant that you needed to think about the purpose of the cluster? 
That is do you want to minimize the nodes but maximize the disk space per node 
and use the cluster as your backup cluster? (Assuming that you are considering 
your DR and BCP in your design.)

The problem with your answer, is that a job has a specific meaning within the 
Hadoop world.  You should have asked what is the purpose of the cluster. 

I agree w Brad, that it depends ... 

But the factors which will impact your cluster design are more along the lines 
of the purpose of the cluster and then the budget along with your IT 
constraints.

IMHO its better to avoid building purpose built clusters. You end up not being 
able to easily recycle the hardware in to new clusters easily. 

But hey what do I know? ;-)

 To: common-user@hadoop.apache.org
 Subject: RE: More cores Vs More Nodes ?
 From: tdeut...@us.ibm.com
 Date: Tue, 13 Dec 2011 09:46:49 -0800
 
 It also helps to know the profile of your job in how you spec the 
 machines. So in addition to Brad's response you should consider if you 
 think your jobs will be more storage or compute oriented. 
 
 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Brad Sarsfield b...@bing.com 
 12/13/2011 09:41 AM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org common-user@hadoop.apache.org
 cc
 
 Subject
 RE: More cores Vs More Nodes ?
 
 
 
 
 
 
 Praveenesh,
 
 Your question is not naïve; in fact, optimal hardware design can 
 ultimately be a very difficult question to answer on what would be 
 better. If you made me pick one without much information I'd go for more 
 machines.  But...
 
 It all depends; and there is no right answer :) 
 
 More machines 
  +May run your workload faster
  +Will give you a higher degree of reliability protection 
 from node / hardware / hard drive failure.
  +More aggregate IO capabilities
  - capex / opex may be higher than allocating more cores
 More cores 
  +May run your workload faster
  +More cores may allow for more tasks to run on the same 
 machine
  +More cores/tasks may reduce network contention and 
 increase increasing task to task data flow performance.
 
 Notice May run your workload faster is in both; as it can be very 
 workload dependant.
 
 My Experience:
 I did a recent experiment and found that given the same number of cores 
 (64) with the exact same network / machine configuration; 
  A: I had 8 machines with 8 cores 
  B: I had 28 machines with 2 cores (and 1x8 core head 
 node)
 
 B was able to outperform A by 2x using teragen and terasort. These 
 machines were running in a virtualized environment; where some of the IO 
 capabilities behind the scenes were being regulated to 400Mbps per node 
 when running in the 2 core configuration vs 1Gbps on the 8 core.  So I 
 would expect the non-throttled scenario to work even better. 
 
 ~Brad
 
 
 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com] 
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?
 
 Hey Guys,
 
 So I have a very naive question in my mind regarding Hadoop cluster nodes 
 ?
 
 more cores or more nodes - Shall I spend money on going from 2-4 core 
 machines, or spend money on buying more nodes less core eg. say 2 machines 
 of 2 cores for example?
 
 Thanks,
 Praveenesh

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel



Brian,

I think you missed my point.

The moment you go and design a cluster for a specific job, you end up getting 
fscked because there's another group who wants to use the shared resource for 
their job which could be orthogonal to the original purpose. It happens 
everyday.

This is why you have to ask if the cluster is being built for a specific 
purpose. Meaning answering the question 'Which of the following best describes 
your cluster: 
a) PoC
b) Development
c) Pre-prod
d) Production
e) Secondary/Backup


Note that sizing the cluster is a different matter. 
Meaning if you know you need a PB of storage, you're going to design the 
cluster differently because once you get to a certain size, you have to 
recognize that your clusters are going to have lots of disk, require 10GBe just 
for the storage. Number of cores would be less of an issue, however again look 
at pricing. 2 socket 8 core Xeon MBs are currently at an optimal price point. 

And again this goes back to the point I was trying to make.
You need to look beyond the number of cores as a determining factor. 
You go too small, you're going to take a hit because of the price/performance 
curve. 
(Remember that you have to consider Machine Room real estate. 100 2 core boxes 
take up much more space than 25 8 core boxes)

If you go to the other extreme... 64 core giant SMP box $ for $$$ (less 
money) build out an 8 node cluster. 

Beyond that, you really, really don't want to build a custom cluster for a 
specific job unless you know that you're going to be running that specific job 
or set of jobs (24x7X365) [And yes, I came across such a use case...]

HTH

-Mike
 From: bbock...@cse.unl.edu
 Subject: Re: More cores Vs More Nodes ?
 Date: Wed, 14 Dec 2011 07:41:25 -0600
 To: common-user@hadoop.apache.org
 
 Actually, there are varying degrees here.
 
 If you have a successful project, you will find other groups at your door 
 wanting to use the cluster too.  Their jobs might be different from the 
 original use case.
 
 However, if you don't understand the original use case (CPU heavy or storage 
 heavy? is a great beginning question), your original project won't be 
 successful.  Then there will be no follow-up users because you failed.
 
 So, you want to have a reasonably general-purpose cluster, but make sure it 
 matches well with the type of jobs.  As an example, we had one group who 
 required an estimated CPU-millenia per byte of data… they needed a general 
 purpose cluster for a certain value of general purpose.
 
 Brian
 
 On Dec 14, 2011, at 7:29 AM, Michael Segel wrote:
 
  
  Aw Tommy, 
  Actually no. You really don't want to do this.
  
  If you actually ran a cluster and worked in the real world, you would find 
  that if you purposely build a cluster for one job, there will be a mandate 
  that some other group needs to use the cluster and that their job has 
  different performance issues and your cluster is now suboptimal for their 
  jobs...
  
  Perhaps you meant that you needed to think about the purpose of the 
  cluster? That is do you want to minimize the nodes but maximize the disk 
  space per node and use the cluster as your backup cluster? (Assuming that 
  you are considering your DR and BCP in your design.)
  
  The problem with your answer, is that a job has a specific meaning within 
  the Hadoop world.  You should have asked what is the purpose of the 
  cluster. 
  
  I agree w Brad, that it depends ... 
  
  But the factors which will impact your cluster design are more along the 
  lines of the purpose of the cluster and then the budget along with your IT 
  constraints.
  
  IMHO its better to avoid building purpose built clusters. You end up not 
  being able to easily recycle the hardware in to new clusters easily. 
  
  But hey what do I know? ;-)
  
  To: common-user@hadoop.apache.org
  Subject: RE: More cores Vs More Nodes ?
  From: tdeut...@us.ibm.com
  Date: Tue, 13 Dec 2011 09:46:49 -0800
  
  It also helps to know the profile of your job in how you spec the 
  machines. So in addition to Brad's response you should consider if you 
  think your jobs will be more storage or compute oriented. 
  
  
  Tom Deutsch
  Program Director
  Information Management
  Big Data Technologies
  IBM
  3565 Harbor Blvd
  Costa Mesa, CA 92626-1420
  tdeut...@us.ibm.com
  
  
  
  
  Brad Sarsfield b...@bing.com 
  12/13/2011 09:41 AM
  Please respond to
  common-user@hadoop.apache.org
  
  
  To
  common-user@hadoop.apache.org common-user@hadoop.apache.org
  cc
  
  Subject
  RE: More cores Vs More Nodes ?
  
  
  
  
  
  
  Praveenesh,
  
  Your question is not naïve; in fact, optimal hardware design can 
  ultimately be a very difficult question to answer on what would be 
  better. If you made me pick one without much information I'd go for more 
  machines.  But...
  
  It all depends; and there is no right answer :) 
  
  More machines

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel


Tommy,

Again, I think you need to really have some real world experience before you 
make generalizations like that.

Sorry, but at a client, we put 6 different groups' applications in production. 
Without going in to detail the jobs in production were orthogonal to one 
another. The point is that were we to build our cluster optimized to one job we 
would have been screwed. Oh wait, I forgot that you worked for IBM and they 
would love to sell you more hardware and consulting to improve the situation... 
(I kee-id, I kee-id) 

Now Seriously, 
The point of this discussion is that you really, really don't want to build the 
cluster optimized for a single job.
The only time you want to do that is if you have a job or set of jobs that you 
plan on running every day 24x7 and the job takes the entire cluster. 
Yes, such jobs do exist. However they are highly irregular and definitely not 
the norm.

One of the other pain points is that developers have to get used to the cluster 
as a shared resource to be used between different teams. This helps to defer 
the costs including maintenance. So as a shared resource, development and 
production, you need to build out a box that handles everything equally.

Had you attended our session at Hadoop World, not only would you have learned 
this... (Don't tune the cluster to the application, but tune the application to 
the cluster) I would have also poked fun of you in person. ;-)

We also talked about avoiding the internet myths and 'truisms'. 

Unless you've had your hands dirty and at customer's sites you're going to find 
the real world is a different place. ;-)
But hey! What do I know?


 To: common-user@hadoop.apache.org
 Subject: RE: More cores Vs More Nodes ?
 From: tdeut...@us.ibm.com
 Date: Wed, 14 Dec 2011 07:56:30 -0800
 
 Putting aside any smarmy responses for a moment - sorry that job(s) 
 wasn't understood as equating to purpose.
 
 If you are building a general purpose sandbox then I think we all agree on 
 building a balanced general purpose cluster. But if you have production 
 use cases in mind then you darn well better try to understand how the 
 cluster will be used/stressed so you don't end up with a hardware spec 
 that doesn't match how the cluster is actually used.
 
 If you can't profile a production use case as to how it will stress the 
 cluster that is a huge warning sign as to project risk. If you are tearing 
 down and re-purposing a cluster that was implemented to support a 
 production use case then the planning failed. 
 
 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com

RE: Regarding loading a big XML file to HDFS

2011-11-21 Thread Michael Segel


Just wanted to address this:
 Basically in My mapreduce program i am expecting a complete XML as my
 input.i have a CustomReader(for XML) in my mapreduce job configuration.My
 main confusion is if namenode distribute data to DataNodes ,there is a
 chance that a part of xml can go to one data node and other half can go in
 another datanode.If that is the case will my custom XMLReader in the
 mapreduce be able to combine it(as mapreduce reads data locally only).
 Please help me on this?
 
 if you can not do anything parallel here, make your input split size to cover 
 complete file size.

 also configure the block size to cover complete file size. In this 
case, only one mapper and reducer will be spawned for file. But here you
 wont get any parallel processing advantage.
 

You can do this in parallel. 
You need to write a custom input format class. (Which is what you're already 
doing...) 

Lets see if I can explain this correctly.
You have an XML record split across block A and block B.

Your map reduce job will instantiate a task per block. 
So in mapper processing block A, you read and process the XML records... when 
you get to the last record, which is only in part of A, mapper A will continue 
on to block B and continue reading the last record. Then stops.
In mapper for block B, the reader will skip and not process data until it sees 
the start of a record. So you end up getting all of your XML records processed 
(no duplication) and done in parallel. 

Does that make sense? 

-Mike


 Date: Tue, 22 Nov 2011 03:08:20 +
 From: mahesw...@huawei.com
 Subject: RE: Regarding loading a big XML file to HDFS
 To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org
 
 Also i am surprising, how you are writing mapreduce application here. Map and 
 reduce will work with key value pairs.
 
 From: Uma Maheswara Rao G
 Sent: Tuesday, November 22, 2011 8:33 AM
 To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org
 Subject: RE: Regarding loading a big XML file to HDFS
 
 __
 From: hari708 [hari...@gmail.com]
 Sent: Tuesday, November 22, 2011 6:50 AM
 To: core-u...@hadoop.apache.org
 Subject: Regarding loading a big XML file to HDFS
 
 Hi,
 I have a big file consisting of XML data.the XML is not represented as a
 single line in the file. if we stream this file using ./hadoop dfs -put
 command to a hadoop directory .How the distribution happens.?
 
 HDFS will didvide the blocks based on your block size configured for the file.
 
 Basically in My mapreduce program i am expecting a complete XML as my
 input.i have a CustomReader(for XML) in my mapreduce job configuration.My
 main confusion is if namenode distribute data to DataNodes ,there is a
 chance that a part of xml can go to one data node and other half can go in
 another datanode.If that is the case will my custom XMLReader in the
 mapreduce be able to combine it(as mapreduce reads data locally only).
 Please help me on this?
 
 if you can not do anything parallel here, make your input split size to cover 
 complete file size.
 also configure the block size to cover complete file size. In this case, only 
 one mapper and reducer will be spawned for file. But here you wont get any 
 parallel processing advantage.
 
 --
 View this message in context: 
 http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871900p32871900.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: Matrix multiplication in Hadoop

2011-11-18 Thread Michael Segel


Ok Mike,

First I admire that you are studying Hadoop. 

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a problem 
which can easily be broken in to a series of parallel tasks where there is 
minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin to 
taking a normalized relational model and trying to run it as is in HBase.
Yes it can be done. But not the best use of resources.

 To: common-user@hadoop.apache.org
 CC: common-user@hadoop.apache.org
 Subject: Re: Matrix multiplication in Hadoop
 From: mspre...@us.ibm.com
 Date: Fri, 18 Nov 2011 12:39:00 -0500
 
 That's also an interesting question, but right now I am studying Hadoop 
 and want to know how well dense MM can be done in Hadoop.
 
 Thanks,
 Mike
 
 
 
 From:   Michel Segel michael_se...@hotmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date:   11/18/2011 12:34 PM
 Subject:Re: Matrix multiplication in Hadoop
 
 
 
 Is Hadoop the best tool for doing large matrix math. 
 Sure you can do it, but, aren't there better tools for these types of 
 problems?
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel

RE: mapr common library?

2011-10-19 Thread Michael Segel


Not sure what you mean

But w MapR, you have everything under the /opt/mapr/hadoop/hadoop* tree.
There you'll see the core library and also the other ./contrib, etc directories.

Sorry, I'm not specific... I'm going from memory.

HTH

-Mike


 Date: Wed, 19 Oct 2011 20:06:50 -0700
 Subject: mapr common library?
 From: alexgauthie...@gmail.com
 To: common-user@hadoop.apache.org
 
 Is there such a thing somewhere? I have the basic nPath, lucene-like search
 processing but looking for ETL like transformations, typical weblog
 processor or clickstream. Anything beyond wordcount would be appreciated
 :)
 
 GodSpeed.
 
 Alex  http://twitter.com/#!/A23Corp

RE: Can we replace namenode machine with some other machine ?

2011-09-22 Thread Michael Segel


I agree w Steve except on one thing...

RAID 5 Bad. RAID 10 (1+0) good.

Sorry this goes back to my RDBMs days where RAID 5 will kill your performance 
and worse...



 Date: Thu, 22 Sep 2011 11:28:39 +0100
 From: ste...@apache.org
 To: common-user@hadoop.apache.org
 Subject: Re: Can we replace namenode machine with some other machine ?
 
 On 22/09/11 05:42, praveenesh kumar wrote:
  Hi all,
 
  Can we replace our namenode machine later with some other machine. ?
  Actually I got a new  server machine in my cluster and now I want to make
  this machine as my new namenode and jobtracker node ?
  Also Does Namenode/JobTracker machine's configuration needs to be better
  than datanodes/tasktracker's ??
 
 
 1. I'd give it lots of RAM - holding data about many files, avoiding 
 swapping, etc.
 
 2. I'd make sure the disks are RAID5, with some NFS-mounted FS that the 
 secondary namenode can talk to. avoids risk of loss of the index, which, 
 if it happens, renders your filesystem worthless. If I was really 
 paranoid I'd have twin raid controllers with separate connections to 
 disk arrays in separate racks, as [Jiang2008] shows that interconnect 
 problems on disk arrays can be higher than HDD failures.
 
 3. if your central switches are at 10 GbE, consider getting a 10GbE NIC 
 and hooking it up directly -this stops the network being the bottleneck, 
 though it does mean the server can have a lot more packets hitting it, 
 so putting more load on it.
 
 4. Leave space for a second CPU and time for GC tuning.
 
 
 JT's are less important; they need RAM but use HDFS for storage. If your 
 cluster is small, NN and JT can be run locally. If you do this, set up 
 DNS to have two hostnames to point to same network address. Then if you 
 ever split them off, everyone whose bookmark says http://jobtracker 
 won't notice
 
 Either way: the NN and the JT are the machines whose availability you 
 care about. The rest is just a source of statistics you can look at later.
 
 -Steve
 
 
 
 [Jiang2008] Are disks the dominant contributor for storage failures?: A 
 comprehensive study of storage subsystem failure characteristics. ACM 
 Transactions on Storage.

RE: Can we replace namenode machine with some other machine ?

2011-09-22 Thread Michael Segel

Well you could do RAID 1 which is just mirroring.
I don't think you need to do any raid 0 or raid 5 (striping) to get better 
performance.
Also if you're using a 1U box, you just need 2 SATA drives internal and then 
NFS mount a drive from your SN for your backup copy...

 Date: Thu, 22 Sep 2011 17:18:55 +0100
 From: ste...@apache.org
 To: common-user@hadoop.apache.org
 Subject: Re: Can we replace namenode machine with some other machine ?

 On 22/09/11 17:13, Michael Segel wrote:

  I agree w Steve except on one thing...

  RAID 5 Bad. RAID 10 (1+0) good.

  Sorry this goes back to my RDBMs days where RAID 5 will kill your 
  performance and worse...

 sorry, I should have said RAID =5. The main thing is you don't want the 
 NN data lost. ever

RE: risks of using Hadoop

2011-09-21 Thread Michael Segel


Tom,

Normally someone who has a personal beef with someone will take it offline and 
deal with it.
Clearly manners aren't your strong point... unfortunately making me respond to 
you in public.

Since you asked, no, I don't have any beefs with IBM. In fact, I happen to have 
quite a few friends within IBM's IM pillar. (although many seem to taking 
Elvis' advice and left the building...)

What I do have a problem with is you and your response to the posts in this 
thread.

Its bad enough that you really don't know what you're talking about. But this 
is compounded by the fact that your posts end with your job title seems to 
indicate that you are a thought leader from a well known, brand name company.  
So unlike some schmuck off the street, because of your job title, someone may 
actually pay attention to you and take what you say at face value.

The issue at hand is that the OP wanted to know the risks so that he can 
address them to give his pointy haired stake holders a warm fuzzy feeling.
SPOF isn't a risk, but a point of FUD that is constantly being brought out by 
people who have an alternative that they wanted to promote.
Brian pretty much put it in to perspective. You attempted to correct him, and 
while Brian was polite, I'm not.  Why? Because I happen to know of enough 
people who still think that what BS IBM trots out must be true and taken at 
face value. 

I think you're more concerned with making an appearance than you are with 
anyone having a good experience. No offense, but again, you're not someone who 
has actual hands on experience so you're not in position to give advice.  I 
don't know to write what you say out of being arrogant, but I have to wonder if 
you actually paid attention in your SSM class. Raising FUD and non issues as 
risk doesn't help anyone promote Hadoop, regardless of the vendor.  What it 
does is cause the stakeholders reason to pause. Overstating risks can cause 
just as much harm as over promising results.  Again, its Sales 101. Perhaps 
you're still trying to convert these folks off Hadoop on to IBM's DB2? No wait, 
that was someone else... and it wasn't Hadoop, it was Informix. (Sorry to the 
list, that was an inside joke that probably went over Tom's head, but for 
someone's  benefit.) 

To help drill the point of the issue home...
1) Look at MapR, an IBM competitor who's derivative already solves this SPOF 
problem.
2) Look at how to set up a cluster (Apache, HortonWorks, Cloudera) where you 
can mitigate this by your node configuration along with simple sysadmin tricks 
like NFS mounting a drive from a different machine within the cluster 
(Preferably a different rack for a back up.)
3) Think about your backup and recovery of your Name Node's files.

There's more, and I would encourage you to actually talk to a professional 
before giving out advice. ;-)

HTH

-Mike

PS. My last PS talked about the big power switch in a switch box in the machine 
room that cuts the power. (When its a lever, do you really need to tell someone 
that its not a light switch? And I guess you could padlock it too) 
Seriously, there is more risk to data loss and corruption based on luser issues 
than there is of a SPOF (NN failure).




 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Wed, 21 Sep 2011 06:20:53 -0700
 
 I am truly sorry if at some point in your life someone dropped an IBM logo 
 on your head and it left a dent - but you are being a jerk.
 
 Right after you were engaging in your usual condescension a person from 
 Xerox posted on the very issue you were blowing off. Things happen. To any 
 system.
 
 I'm not knocking Hadoop - and frankly making sure new users have a good 
 experience based on the real things that need to be aware of / manage is 
 in everyone's interests here to grow the footprint.
 
 Please take note that no where in here have I ever said anything to 
 discourage Hadoop deployments/use or anything that is vendor specific.
 
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Michael Segel michael_se...@hotmail.com 
 09/20/2011 02:52 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 RE: risks of using Hadoop
 
 
 
 
 
 
 
 Tom,
 
 I think it is arrogant to parrot FUD when you've never had your hands 
 dirty in any real Hadoop environment. 
 So how could your response reflect the operational realities of running a 
 Hadoop cluster?
 
 What Brian was saying was that the SPOF is an over played FUD trump card. 
 Anyone who's built clusters will have mitigated the risks of losing the 
 NN. 
 Then there's MapR... where you don't have a SPOF. But again that's a 
 derivative of Apache Hadoop.
 (Derivative isn't a bad thing...)
 
 You're right that you need

RE: risks of using Hadoop

2011-09-21 Thread Michael Segel


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


 Date: Wed, 21 Sep 2011 17:02:45 +0100
 Subject: Re: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as the one you
  mention in your email(Federation changes), If I need scaled nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 
 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
  and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
  versions. ( i think in 22). this may not be the problem for your cluster.
  But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 20Append has the
  support for append. 0.20.205 release also will have append support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the proposal.
  
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks of using Hadoop
   To: common-user common-user@hadoop.apache.org
  
   Hello,
  
   Please can someone point some of the risks we may incur if we
   decide to
   implement Hadoop?
  
   BR,
  
   Isaac.

RE: risks of using Hadoop

2011-09-20 Thread Michael Segel


Tom,

I think it is arrogant to parrot FUD when you've never had your hands dirty in 
any real Hadoop environment. 
So how could your response reflect the operational realities of running a 
Hadoop cluster?

What Brian was saying was that the SPOF is an over played FUD trump card. 
Anyone who's built clusters will have mitigated the risks of losing the NN. 
Then there's MapR... where you don't have a SPOF. But again that's a derivative 
of Apache Hadoop.
(Derivative isn't a bad thing...)

You're right that you need to plan accordingly, however from risk perspective, 
this isn't a risk. 
In fact, I believe Tom White's book has a good layout to mitigate this and 
while I have First Ed, I'll have to double check the second ed to see if he 
modified it.

Again, the point Brian was making and one that I agree with is that the NN as a 
SPOF is an overblown 'risk'.

You have a greater chance of data loss than you do of losing your NN. 

Probably the reason why some of us are a bit irritated by the SPOF reference to 
the NN is that its clowns who haven't done any work in this space, pick up on 
the FUD and spread it around. This makes it difficult for guys like me from 
getting anything done because we constantly have to go back and reassure stake 
holders that its a non-issue.

With respect to naming vendors, I did name MapR outside of Apache because they 
do have their own derivative release that improves upon the limitations found 
in Apache's Hadoop.

-Mike
PS... There's this junction box in your machine room that has this very large 
on/off switch. If pulled down, it will cut power to your cluster and you will 
lose everything. Now would you consider this a risk? Sure. But is it something 
you should really lose sleep over? Do you understand that there are risks and 
there are improbable risks? 


 To: common-user@hadoop.apache.org
 Subject: RE: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Tue, 20 Sep 2011 12:48:05 -0700
 
 No worries Michael - it would be stretch to see any arrogance or 
 disrespect in your response.
 
 Kobina has asked a fair question, and deserves a response that reflects 
 the operational realities of where we are. 
 
 If you are looking at doing large scale CDR handling - which I believe is 
 the use case here - you need to plan accordingly. Even you use the term 
 mitigate - which is different than prevent.  Kobina needs an 
 understanding of that they are looking at. That isn't a pro/con stance on 
 Hadoop, it is just reality and they should plan accordingly. 
 
 (Note - I'm not the one who brought vendors into this - which doesn't 
 strike me as appropriate for this list)
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Michael Segel michael_se...@hotmail.com 
 09/17/2011 07:37 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 RE: risks of using Hadoop
 
 
 
 
 
 
 
 Gee Tom,
 No disrespect, but I don't believe you have any personal practical 
 experience in designing and building out clusters or putting them to the 
 test.
 
 Now to the points that Brian raised..
 
 1) SPOF... it sounds great on paper. Some FUD to scare someone away from 
 Hadoop. But in reality... you can mitigate your risks by setting up raid 
 on your NN/HM node. You can also NFS mount a copy to your SN (or whatever 
 they're calling it these days...) Or you can go to MapR which has 
 redesigned HDFS which removes this problem. But with your Apache Hadoop or 
 Cloudera's release, losing your NN is rare. Yes it can happen, but not 
 your greatest risk. (Not by a long shot)
 
 2) Data Loss.
 You can mitigate this as well. Do I need to go through all of the options 
 and DR/BCP planning? Sure there's always a chance that you have some Luser 
 who does something brain dead. This is true of all databases and systems. 
 (I know I can probably recount some of IBM's Informix and DB2 having data 
 loss issues. But that's a topic for another time. ;-)
 
 I can't speak for Brian, but I don't think he's trivializing it. In fact I 
 think he's doing a fine job of level setting expectations.
 
 And if you talk to Ted Dunning of MapR, I'm sure he'll point out that 
 their current release does address points 3 and 4 again making their risks 
 moot. (At least if you're using MapR)
 
 -Mike
 
 
  Subject: Re: risks of using Hadoop
  From: tdeut...@us.ibm.com
  Date: Sat, 17 Sep 2011 17:38:27 -0600
  To: common-user@hadoop.apache.org
  
  I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off 
 as overblown, especially as Hadoop become more production oriented in 
 utilization

RE: Using HBase for real time transaction

2011-09-20 Thread Michael Segel


Since Tom isn't technical... ;-)

The short answer is No.
HBase is not capable of being a transactional because it doesn't support 
transactions. 
Nor is HBase ACID compliant. 

Having said that, yes you can use HBase to serve data in real time. 

HTH

-Mike


 Subject: Re: Using HBase for real time transaction
 From: jign...@websoft.com
 Date: Tue, 20 Sep 2011 17:25:17 -0400
 To: common-user@hadoop.apache.org
 
 Tom,
 Let me reword: can HBase be used as a transactional database(i.e. in 
 replacement of mysql)?
 
 The requirement is to have real time read and write operations. I mean as 
 soon as data is written the user should see the data(Here data should be 
 written in Hbase).
 
 -Jignesh
 
 
 On Sep 20, 2011, at 5:11 PM, Tom Deutsch wrote:
 
  Real-time means different things to different people. Can you share your 
  latency requirements from the time the data is generated to when it needs 
  to be consumed, or how you are thinking of using Hbase in the overall 
  flow?
  
  
  Tom Deutsch
  Program Director
  CTO Office: Information Management
  Hadoop Product Manager / Customer Exec
  IBM
  3565 Harbor Blvd
  Costa Mesa, CA 92626-1420
  tdeut...@us.ibm.com
  
  
  
  
  Jignesh Patel jign...@websoft.com 
  09/20/2011 12:57 PM
  Please respond to
  common-user@hadoop.apache.org
  
  
  To
  common-user@hadoop.apache.org
  cc
  
  Subject
  Using HBase for real time transaction
  
  
  
  
  
  
  We are exploring possibility of using HBase for the real time 
  transactions. Is that possible?
  
  -Jignesh

RE: Using HBase for real time transaction

2011-09-20 Thread Michael Segel

 Date: Tue, 20 Sep 2011 15:05:31 -0700
 Subject: Re: Using HBase for real time transaction
 From: jdcry...@apache.org
 To: common-user@hadoop.apache.org

 While HBase isn't ACID-compliant, it does have have some guarantees:

 http://hbase.apache.org/acid-semantics.html

 J-D

I think there has to be some clarification. 

The OP was asking about a mySQL replacement. 
HBase will never be a RDBMS replacement.  No Transactions means no way of doing 
OLTP.
Its the wrong tool for that type of work. 

Sure I know I can kludge something but its not worth the effort. Choose a 
better tool like a real database... e.g. Informix. 

Recognize what HBase is and what it is not. 

This doesn't mean you can't take in or deliver data in real time, it can.
So if you want to use it in a real time manner, sure. Note that like with other 
databases, you will have to do some work to handle real time data.
I guess you would have to provide a specific use case on what you want to 
achieve in order to know if its a good fit.

HTH

-Mike

 On Tue, Sep 20, 2011 at 2:56 PM, Michael Segel
 michael_se...@hotmail.com wrote:

  Since Tom isn't technical... ;-)

  The short answer is No.
  HBase is not capable of being a transactional because it doesn't support 
  transactions.
  Nor is HBase ACID compliant.

  Having said that, yes you can use HBase to serve data in real time.

  HTH

  -Mike

RE: risks of using Hadoop

2011-09-17 Thread Michael Segel


Gee Tom,
No disrespect, but I don't believe you have any personal practical experience 
in designing and building out clusters or putting them to the test.

Now to the points that Brian raised..

1) SPOF... it sounds great on paper. Some FUD to scare someone away from 
Hadoop. But in reality... you can mitigate your risks by setting up raid on 
your NN/HM node. You can also NFS mount a copy to your SN (or whatever they're 
calling it these days...) Or you can go to MapR which has redesigned HDFS which 
removes this problem. But with your Apache Hadoop or Cloudera's release, losing 
your NN is rare. Yes it can happen, but not your greatest risk. (Not by a long 
shot)

2) Data Loss.
You can mitigate this as well. Do I need to go through all of the options and 
DR/BCP planning? Sure there's always a chance that you have some Luser who does 
something brain dead. This is true of all databases and systems. (I know I can 
probably recount some of IBM's Informix and DB2 having data loss issues. But 
that's a topic for another time. ;-)

I can't speak for Brian, but I don't think he's trivializing it. In fact I 
think he's doing a fine job of level setting expectations.

And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their 
current release does address points 3 and 4 again making their risks moot. (At 
least if you're using MapR)

-Mike


 Subject: Re: risks of using Hadoop
 From: tdeut...@us.ibm.com
 Date: Sat, 17 Sep 2011 17:38:27 -0600
 To: common-user@hadoop.apache.org
 
 I disagree Brian - data loss and system down time (both potentially 
 non-trival) should not be taken lightly. Use cases and thus availability 
 requirements do vary, but I would not encourage anyone to shrug them off as 
 overblown, especially as Hadoop become more production oriented in 
 utilization.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Brian Bockelman [bbock...@cse.unl.edu]
 Sent: 09/17/2011 05:11 PM EST
 To: common-user@hadoop.apache.org
 Subject: Re: risks of using Hadoop
 
 
 
 
 On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote:
 
  Hi Kobina,
  
  Some experiences which may helpful for you with respective to DFS. 
  
  1. Selecting the correct version.
 I will recommend to use 0.20X version. This is pretty stable version and 
  all other organizations prefers it. Well tested as well.
  Dont go for 21 version.This version is not a stable version.This is risk.
  
  2. You should perform thorough test with your customer operations. 
   (of-course you will do this :-))
  
  3. 0.20x version has the problem of SPOF.
If NameNode goes down you will loose the data.One way of recovering is by 
  using the secondaryNameNode.You can recover the data till last 
  checkpoint.But here manual intervention is required.
  In latest trunk SPOF will be addressed bu HDFS-1623.
  
  4. 0.20x NameNodes can not scale. Federation changes included in latest 
  versions. ( i think in 22). this may not be the problem for your cluster. 
  But please consider this aspect as well.
  
 
 With respect to (3) and (4) - these are often completely overblown for many 
 Hadoop use cases.  If you use Hadoop as originally designed (large scale 
 batch data processing), these likely don't matter.
 
 If you're looking at some of the newer use cases (low latency stuff or 
 time-critical processing), or if you architect your solution poorly (lots of 
 small files), these issues become relevant.  Another case where I see folks 
 get frustrated is using Hadoop as a plain old batch system; for non-data 
 workflows, it doesn't measure up against specialized systems.
 
 You really want to make sure that Hadoop is the best tool for your job.
 
 Brian

RE: risks of using Hadoop

2011-09-16 Thread Michael Segel


Risks?

Well if you come to Hadoop World in Nov, we actually have a presentation that 
might help reduce some of your initial risks.

There are always risks when starting a new project. Regardless of the 
underlying technology, you have costs associated with failure and unless you 
can level set expectations you'll increase your odds of failure. 

Best advice... don't listen to sales critters or marketing folks. ;-) [Right 
Tom?]
They have an agenda.
 ;-)

 Date: Fri, 16 Sep 2011 20:11:20 +0100
 Subject: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Hello,
 
 Please can someone point some of the risks we may incur if we decide to
 implement Hadoop?
 
 BR,
 
 Isaac.

RE: Hadoop multi tier backup

2011-08-30 Thread Michael Segel


Matthew, the short answer is hire a consultant to work with you on your DR/BCP 
strategy. :-)

Short of that... you have a couple of things...

Your back-up cluster, is it in the same site? (What happens when site goes 
down?)

Are you planning to make your back up cluster and main cluster homogenous? By 
this I mean if your main cluster has 1PB of disk w 4x2TB or 4x3TB drives, will 
your backup cluster have the same configuration? 
(You may want to consider asymmetry in designing your clusters) So your backup 
cluster has fewer nodes but more drives per node.

You also have to look at your data. Are your data sets small and discrete? If 
so, you could probably back them up to tape, (snapshots) , just in case of 
human error and you didn't catch it in time and the error gets propagated to 
your backup cluster.

I haven't played with fuse, so I don't know if there are any performance 
issues, but on a back up cluster, I don't think its much of an issue.

 From: matthew.go...@monsanto.com
 To: common-user@hadoop.apache.org; cdh-u...@cloudera.org
 Subject: Hadoop multi tier backup
 Date: Tue, 30 Aug 2011 16:54:07 +
 
 All,
 
 We were discussing how we would backup our data from the various environments 
 we will have and I was hoping someone could chime in with previous experience 
 in this. My primary concern about our cluster is that we would like to be 
 able to recover anything within the last 60 days so having full backups both 
 on tape and through distcp is preferred.
 
 Out initial thoughts can be seen in the jpeg attached but just in case any of 
 you are weary of attachments it can also be summarized below:
 
 Prod Cluster --DistCp-- On-site Backup cluster with Fuse mount point running 
 NetBackup daemon --NetBackup-- Media Server -- Tape
 
 One of our biggest grey areas so far is how do most people accomplish 
 incremental backups? Our thought was to tie this into our NetBackup 
 configuration as this can be done for other connectors but we do not see 
 anything for HDFS yet.
 
 Thanks,
 Matt
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.
 
 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for checking 
 for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.
 
 
 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.

RE: Hadoop--store a sequence file in distributed cache?

2011-08-13 Thread Michael Segel


Sofia,

I was about to say that if your file is already on hdfs, you should just be 
able to open it. 
But as I type this, I have this thing kicking me in the back of the head 
reminding me that you may not be able to access the hdfs file at the same time 
someone else is accessing it? (Going from memory, is there an exclusive lock on 
the file when you open it in HDFS?)

If not, you can just use your file. 
If so, you will need to use distributed cache which copies a copy of the file 
to some place local on each node running the task. Within your task you need to 
query the distributed cache for your file and get the path to the file so you 
can open it.
Depending on the size of your index... which can get large, you need to open 
the file once and just reset to the beginning of the file. 

My suggestion is to consider putting your RTree into HBase. So HBase contains 
your index.


 Date: Sat, 13 Aug 2011 03:02:32 -0700
 From: geosofie_...@yahoo.com
 Subject: Re: Hadoop--store a sequence file in distributed cache?
 To: common-user@hadoop.apache.org
 
 Good morning,
 
 I am a little confused, I have to say.
 
 A summury of the project first: I want to examine how an Rtree on HDFS would 
 speed up spatial queries like point/range queries, that normally target a 
 very small part of the original input.
 
 I have built my Rtree on HDFS, and now I need to answer queries using it. I 
 thought I could make an MR Job that takes as input a text file where each 
 line is a query (for example we have 2 queries). To answer the queries 
 efficiently, I need to check some information about the root nodes of the 
 tree, which are stored in R files (R=the #reducers of the previous job). 
 These files are small in size and are read from every mapper, thus the idea 
 of distributed cache fits, right?
 
 I have built an ArrayList during setup() to avoid opening all the files in 
 distributed cache, and open only 3-4 of them for example. I agree, though, 
 that opening and closing these files so many times is an important overhead. 
 I think however, that opening these files from HDFS rather than distributed 
 cache would be even worse, since the file accessing operations in HDFS are 
 much more expensive than accessing files locally.
 
 Thank you all for your response, I would be glad to have more feedback.
 Sofia
 
 
 
 
 
 
 From: GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Friday, August 12, 2011 7:05 PM
 Subject: RE: Hadoop--store a sequence file in distributed cache?
 
 Sofia correct me if I am wrong, but Mike I think this thread was about using 
 the output of a previous job, in this case already in sequence file format, 
 as in memory join data for another job.
 
 Side note: does anyone know what the rule of thumb on file size is when using 
 the distributed cache vs just reading from HDFS (join data not binary files)? 
 I always thought that having a setup phase on a mapper read directly from 
 HDFS was a asking for trouble and that you should always distribute to each 
 node but I am hearing more and more people say to just read directly from 
 HDFS for larger file sizes to avoid the IO cost of the distributed cache.
 
 Matt
 
 -Original Message-
 From: Ian Michael Gumby [mailto:michael_se...@hotmail.com] 
 Sent: Friday, August 12, 2011 10:54 AM
 To: common-user@hadoop.apache.org
 Subject: RE: Hadoop--store a sequence file in distributed cache?
 
 
 This whole thread doesn't make a lot of sense.
 
 If your first m/r job creates the sequence files, which you then use as input 
 files to your second job, you don't need to use distributed cache since the 
 output of the first m/r job is going to be in HDFS.
 (Dino is correct on that account.)
 
 Sofia replied saying that she needed to open and close the sequence file to 
 access the data in each Mapper.map() call. 
 Without knowing more about the specific app, Ashook is correct that you could 
 read the file in Mapper.setup() and then access it in memory.
 Joey is correct you can put anything in distributed cache, but you don't want 
 to put an HDFS file in to distributed cache. Distributed cache is a tool for 
 taking something from your job and distributing it to each job tracker as a 
 local object. It does have a bit of overhead. 
 
 A better example is if you're distributing binary objects  that you want on 
 each node. A c++ .so file that you want to call from within your java m/r.
 
 If you're not using all of the data in the sequence file, what about using 
 HBase?
 
 
  From: ash...@clearedgeit.com
  To: common-user@hadoop.apache.org
  Date: Fri, 12 Aug 2011 09:06:39 -0400
  Subject: RE: Hadoop--store a sequence file in distributed cache?
  
  If you are looking for performance gains, then possibly reading these files 
  once during the setup() call in your Mapper and storing them in some data 
  structure like a Map or a List will

RE: Speed up node under replicated block during decomission

2011-08-12 Thread Michael Segel


Just a thought...

Really quick and dirty thing to do is to turn off the node. 
Within 10 minutes the node looks down to the JT and NN so it gets marked as 
down.
Run an fsck and it will show the files as under replicated and then will do the 
replication at the faster speed to rebalance the cluster.
(100MB/sec should be ok on a 1GBe link)

Then you can drop the next node... much faster than trying to decomission the 
node.

Its not the best way to do it, but it works.


 From: ha...@cloudera.com
 Date: Fri, 12 Aug 2011 22:38:08 +0530
 Subject: Re: Speed up node under replicated block during decomission
 To: common-user@hadoop.apache.org
 
 It could be that your process has hung cause a particular resident
 block (file) requires a very large replication factor, and your
 remaining # of nodes is less than that value. This is a genuine reason
 for hang (but must be fixed). The process usually waits until there
 are no under-replicated blocks, so I'd use fsck to check if any such
 ones are present and setrep them to a lower value.
 
 On Fri, Aug 12, 2011 at 9:28 PM,  jonathan.hw...@accenture.com wrote:
  Hi All,
 
  I'm trying to decommission data node from my cluster.  I put the data node 
  in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name 
  nodes.  The under-replicated blocks are starting to replicate, but it's 
  going down in a very slow pace.  For 1 TB of data it takes over 1 day to 
  complete.   We change the settings as below and try to increase the 
  replication rate.
 
  Added this to hdfs-site.xml on all the nodes on the cluster and restarted 
  the data nodes and name node processes.
  property
   !-- 100Mbit/s --
   namedfs.balance.bandwidthPerSec/name
   value131072000/value
  /property
 
  Speed didn't seem to pick up. Do you know what may be happening?
 
  Thanks!
  Jonathan
 
  This message is for the designated recipient only and may contain 
  privileged, proprietary, or otherwise private information.  If you have 
  received it in error, please notify the sender immediately and delete the 
  original.  Any other use of the email by you is prohibited.
 
 
 
 
 -- 
 Harsh J

RE: Moving Files to Distributed Cache in MapReduce

2011-08-01 Thread Michael Segel


Yeah,

I'll write something up and post it on my web site. Definitely not InfoQ stuff, 
but a simple tip and tricks stuff.

-Mike


 Subject: Re: Moving Files to Distributed Cache in MapReduce
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 19:21:14 -0700
 To: common-user@hadoop.apache.org
 
 
 We really need to build a working example to the wiki and add a link from the 
 FAQ page.  Any volunteers?
 
 On Jul 29, 2011, at 7:49 PM, Michael Segel wrote:
 
  
  Here's the meat of my post earlier...
  Sample code on putting a file on the cache:
  DistributedCache.addCacheFile(new URI(path+MyFileName,conf));
  
  Sample code in pulling data off the cache:
private Path[] localFiles = 
  DistributedCache.getLocalCacheFiles(context.getConfiguration());
 boolean exitProcess = false;
int i=0;
 while (!exit){ 
 fileName = localFiles[i].getName();
if (fileName.equalsIgnoreCase(model.txt)){
  // Build your input file reader on localFiles[i].toString() 
  exitProcess = true;
}
 i++;
 } 
  
  
  Note that this is SAMPLE code. I didn't trap the exit condition if the file 
  isn't there and you go beyond the size of the array localFiles[].
  Also I set exit to false because its easier to read this as Do this loop 
  until the condition exitProcess is true.
  
  When you build your file reader you need the full path, not just the file 
  name. The path will vary when the job runs.
  
  HTH
  
  -Mike
  
  
  From: michael_se...@hotmail.com
  To: common-user@hadoop.apache.org
  Subject: RE: Moving Files to Distributed Cache in MapReduce
  Date: Fri, 29 Jul 2011 21:43:37 -0500
  
  
  I could have sworn that I gave an example earlier this week on how to push 
  and pull stuff from distributed cache.
  
  
  Date: Fri, 29 Jul 2011 14:51:26 -0700
  Subject: Re: Moving Files to Distributed Cache in MapReduce
  From: rogc...@ucdavis.edu
  To: common-user@hadoop.apache.org
  
  jobConf is deprecated in 0.20.2 I believe; you're supposed to be using
  Configuration for that
  
  On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia 
  mohitanch...@gmail.comwrote:
  
  Is this what you are looking for?
  
  http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
  
  search for jobConf
  
  On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote:
  Thanks for the response! However, I'm having an issue with this line
  
  Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
  
  because conf has private access in org.apache.hadoop.configured
  
  On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com
  wrote:
  
  I hope my previous reply helps...
  
  On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu
  wrote:
  
  After moving it to the distributed cache, how would I call it within
  my
  MapReduce program?
  
  On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn 
  mapred.le...@gmail.com
  wrote:
  
  Did you try using -files option in your hadoop jar command as:
  
  /usr/bin/hadoop jar jar name main class name -files  absolute
  path
  of
  file to be added to distributed cache input dir output dir
  
  
  On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu
  wrote:
  
  Slight modification: I now know how to add files to the
  distributed
  file
  cache, which can be done via this command placed in the main or
  run
  class:
  
DistributedCache.addCacheFile(new
  URI(/user/hadoop/thefile.dat),
  conf);
  
  However I am still having trouble locating the file in the
  distributed
  cache. *How do I call the file path of thefile.dat in the
  distributed
  cache
  as a string?* I am using Hadoop 0.20.2
  
  
  On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu
  
  wrote:
  
  Hi all,
  
  Does anybody have examples of how one moves files from the local
  filestructure/HDFS to the distributed cache in MapReduce? A
  Google
  search
  turned up examples in Pig but not MR.
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  -- 
  Roger Chen
  UC Davis Genome Center

RE: Hadoop cluster network requirement

2011-08-01 Thread Michael Segel


Yeah what he said.
Its never a good idea.
Forget about losing a NN or a Rack, but just losing connectivity between data 
centers. (It happens more than you think.)
Your entire cluster in both data centers go down. Boom!

Its a bad design. 

You're better off doing two different clusters.

Is anyone really trying to sell this as a design? That's even more scary.


 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
  Thanks, I'm independently doing some digging into Hadoop networking
  requirements and 
  had a couple of quick follow-ups. Could I have some specific info on why
  different data centers 
  cannot be supported for master node and data node comms?
  Also, what 
  may be the benefits/use cases for such a scenario?
 
   Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations.  
 That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network replication 
 storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
   ... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
   If you want disaster recovery, set up two completely separate HDFSes 
 and run everything in parallel.

RE: Moving Files to Distributed Cache in MapReduce

2011-07-29 Thread Michael Segel

I could have sworn that I gave an example earlier this week on how to push and 
pull stuff from distributed cache.

 Date: Fri, 29 Jul 2011 14:51:26 -0700
 Subject: Re: Moving Files to Distributed Cache in MapReduce
 From: rogc...@ucdavis.edu
 To: common-user@hadoop.apache.org

 jobConf is deprecated in 0.20.2 I believe; you're supposed to be using
 Configuration for that

 On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

  Is this what you are looking for?

  http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

  search for jobConf

  On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote:
   Thanks for the response! However, I'm having an issue with this line

   Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

   because conf has private access in org.apache.hadoop.configured

   On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com
  wrote:

   I hope my previous reply helps...

   On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu
  wrote:

After moving it to the distributed cache, how would I call it within
  my
MapReduce program?

On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn 
  mapred.le...@gmail.com
wrote:

 Did you try using -files option in your hadoop jar command as:

 /usr/bin/hadoop jar jar name main class name -files  absolute
  path
of
 file to be added to distributed cache input dir output dir

 On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu
wrote:

  Slight modification: I now know how to add files to the
  distributed
file
  cache, which can be done via this command placed in the main or
  run
 class:

 DistributedCache.addCacheFile(new
URI(/user/hadoop/thefile.dat),
  conf);

  However I am still having trouble locating the file in the
   distributed
  cache. *How do I call the file path of thefile.dat in the
  distributed
 cache
  as a string?* I am using Hadoop 0.20.2

  On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu

 wrote:

   Hi all,

   Does anybody have examples of how one moves files from the local
   filestructure/HDFS to the distributed cache in MapReduce? A
  Google
 search
   turned up examples in Pig but not MR.

   --
   Roger Chen
   UC Davis Genome Center

  --
  Roger Chen
  UC Davis Genome Center

--
Roger Chen
UC Davis Genome Center

   --
   Roger Chen
   UC Davis Genome Center

 -- 
 Roger Chen
 UC Davis Genome Center

RE: Moving Files to Distributed Cache in MapReduce

2011-07-29 Thread Michael Segel


Here's the meat of my post earlier...
Sample code on putting a file on the cache:
DistributedCache.addCacheFile(new URI(path+MyFileName,conf));

Sample code in pulling data off the cache:
   private Path[] localFiles = 
DistributedCache.getLocalCacheFiles(context.getConfiguration());
boolean exitProcess = false;
   int i=0;
while (!exit){ 
fileName = localFiles[i].getName();
   if (fileName.equalsIgnoreCase(model.txt)){
 // Build your input file reader on localFiles[i].toString() 
 exitProcess = true;
   }
i++;
} 
 
 
Note that this is SAMPLE code. I didn't trap the exit condition if the file 
isn't there and you go beyond the size of the array localFiles[].
Also I set exit to false because its easier to read this as Do this loop until 
the condition exitProcess is true.
 
When you build your file reader you need the full path, not just the file name. 
The path will vary when the job runs.
 
HTH
 
-Mike
 

 From: michael_se...@hotmail.com
 To: common-user@hadoop.apache.org
 Subject: RE: Moving Files to Distributed Cache in MapReduce
 Date: Fri, 29 Jul 2011 21:43:37 -0500
 
 
 I could have sworn that I gave an example earlier this week on how to push 
 and pull stuff from distributed cache.
 
 
  Date: Fri, 29 Jul 2011 14:51:26 -0700
  Subject: Re: Moving Files to Distributed Cache in MapReduce
  From: rogc...@ucdavis.edu
  To: common-user@hadoop.apache.org
  
  jobConf is deprecated in 0.20.2 I believe; you're supposed to be using
  Configuration for that
  
  On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia 
  mohitanch...@gmail.comwrote:
  
   Is this what you are looking for?
  
   http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
  
   search for jobConf
  
   On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote:
Thanks for the response! However, I'm having an issue with this line
   
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
   
because conf has private access in org.apache.hadoop.configured
   
On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com
   wrote:
   
I hope my previous reply helps...
   
On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu
   wrote:
   
 After moving it to the distributed cache, how would I call it within
   my
 MapReduce program?

 On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn 
   mapred.le...@gmail.com
 wrote:

  Did you try using -files option in your hadoop jar command as:
 
  /usr/bin/hadoop jar jar name main class name -files  absolute
   path
 of
  file to be added to distributed cache input dir output dir
 
 
  On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu
 wrote:
 
   Slight modification: I now know how to add files to the
   distributed
 file
   cache, which can be done via this command placed in the main or
   run
  class:
  
  DistributedCache.addCacheFile(new
 URI(/user/hadoop/thefile.dat),
   conf);
  
   However I am still having trouble locating the file in the
distributed
   cache. *How do I call the file path of thefile.dat in the
   distributed
  cache
   as a string?* I am using Hadoop 0.20.2
  
  
   On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu
   
  wrote:
  
Hi all,
   
Does anybody have examples of how one moves files from the 
local
filestructure/HDFS to the distributed cache in MapReduce? A
   Google
  search
turned up examples in Pig but not MR.
   
--
Roger Chen
UC Davis Genome Center
   
  
  
  
   --
   Roger Chen
   UC Davis Genome Center
  
 



 --
 Roger Chen
 UC Davis Genome Center

   
   
   
   
--
Roger Chen
UC Davis Genome Center
   
  
  
  
  
  -- 
  Roger Chen
  UC Davis Genome Center

RE: Hadoop upgrade Java version

2011-07-19 Thread Michael Segel


Yeah... you can do that...

I haven't tried to mix/match different releases within a cluster, although I 
suspect I could without any problems, but I don't want to risk it.

Until we have a problem, or until we expand our clouds with a batch of new 
nodes, I like to follow the mantra... if it aint broke, don't fix it. 
(I would suggest if / when you upgrade your java that you bounce the cloud. 
Even with a rolling restart, you have to plan for it...)



 Date: Mon, 18 Jul 2011 22:54:54 -0500
 Subject: RE: Hadoop upgrade Java version
 From: jshrini...@gmail.com
 To: common-user@hadoop.apache.org
 
 We are using Oracle JDK 6 update 26 and have not observed any problems so
 far. EA of JDK 6 update 27 is available now. We are planning to move to
 update 27 when the GA release is made available.
 
 -Shrinivas
 On Jul 18, 2011 7:52 PM, Michael Segel michael_se...@hotmail.com wrote:
 
  Any release after _21 seems to work fine.
 
 
  CC: highpoint...@gmail.com; common-user@hadoop.apache.org
  From: john.c.st...@gmail.com
  Subject: Re: Hadoop upgrade Java version
  Date: Mon, 18 Jul 2011 19:37:02 -0600
  To: common-user@hadoop.apache.org
 
  We're using u26 without any problems.
 
  On Jul 18, 2011, at 4:45 PM, highpointe highpoint...@gmail.com wrote:
 
   So uhm yeah. Thanks for the Informica commercial.
  
   Now back to my original question.
  
   Anyone have a suggestion on what version of Java I should be using with
 the latest Hadoop release.
  
   Sent from my iPhone
  
   On Jul 18, 2011, at 11:26 AM, high pointe highpoint...@gmail.com
 wrote:
  
   We are in the process of upgrading to the most current version of
 Hadoop.
  
   At the same time we are in need of upgrading Java. We are currently
 running u17.
  
   I have read elsewhere that u21 or up is the best route to go.
 Currently the version is u26.
  
   Has anyone gone all the way to u26 with or without issues?
  
   Thanks for the help.

RE: Which release to use?

2011-07-18 Thread Michael Segel


EMC has inked a deal with MapRTech to resell their release and support services 
for MapRTech.
Does this mean that they are going to stop selling their own release on 
Greenplum? Maybe not in the near future, however,
a Greenplum appliance may not get the customer transaction that their reselling 
of MapR will generate.

It sounds like they are hedging their bets and are taking an 'IBM' approach.


 Subject: RE: Which release to use?
 Date: Mon, 18 Jul 2011 08:30:59 -0500
 From: jeff.schm...@shell.com
 To: common-user@hadoop.apache.org
 
 Steve,
 
 I read your blog nice post - I believe EMC is selling the Greenplumb
 solution as an appliance - 
 
 Cheers - 
 
 Jeffery
 
 -Original Message-
 From: Steve Loughran [mailto:ste...@apache.org] 
 Sent: Friday, July 15, 2011 4:07 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Which release to use?
 
 On 15/07/2011 18:06, Arun C Murthy wrote:
  Apache Hadoop is a volunteer driven, open-source project. The
 contributors to Apache Hadoop, both individuals and folks across a
 diverse set of organizations, are committed to driving the project
 forward and making timely releases - see discussion on hadoop-0.23 with
 a raft newer features such as HDFS Federation, NextGen MapReduce and
 plans for HA NameNode etc.
 
  As with most successful projects there are several options for
 commercial support to Hadoop or its derivatives.
 
  However, Apache Hadoop has thrived before there was any commercial
 support (I've personally been involved in over 20 releases of Apache
 Hadoop and deployed them while at Yahoo) and I'm sure it will in this
 new world order.
 
  We, the Apache Hadoop community, are committed to keeping Apache
 Hadoop 'free', providing support to our users and to move it forward at
 a rapid rate.
 
 
 Arun makes a good point which is that the Apache project depends on 
 contributions from the community to thrive. That includes
 
   -bug reports
   -patches to fix problems
   -more tests
   -documentation improvements: more examples, more on getting started, 
 troubleshooting, etc.
 
 If there's something lacking in the codebase, and you think you can fix 
 it, please do so. Helping with the documentation is a good start, as it 
 can be improved, and you aren't going to break anything.
 
 Once you get into changing the code, you'll end up working with the head
 
 of whichever branch you are targeting.
 
 The other area everyone can contribute on is testing. Yes, Y! and FB can
 
 test at scale, yes, other people can test large clusters too -but nobody
 
 has a network that looks like yours but you. And Hadoop does care about 
 network configurations. Testing beta and release candidate releases in 
 your infrastructure, helps verify that the final release will work on 
 your site, and you don't end up getting all the phone calls about 
 something not working

RE: Which release to use?

2011-07-18 Thread Michael Segel


Tom,

I'm not sure that you're really honoring the purpose and approach of this list.

I mean on the one hand, you're not under any obligation to respond or 
participate on the list. And I can respect that. You're not in an SD role so 
you're not 'customer facing' and not used to having to deal with these types of 
questions.

On the other, you're not being free with your information. So when this type of 
question comes up, it becomes very easy to discount IBM as a release or source 
provider for commercial support.

Without information, I'm afraid that I may have to make recommendations to my 
clients that may be out of date.

There is even some speculation from analysts that recent comments from IBM are 
more of an indication that IBM is still not ready for prime time. 

I'm sorry you're not in a position to detail your offering.

Maybe by September you might be ready and then talk to our CHUG?

-Mike



 To: common-user@hadoop.apache.org
 Subject: Re: Which release to use?
 From: tdeut...@us.ibm.com
 Date: Sat, 16 Jul 2011 10:29:55 -0700
 
 Hi Rita - I want to make sure we are honoring the purpose/approach of this 
 list. So you are welcome to ping me for information, but let's take this 
 discussion off the list at this point.
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Rita rmorgan...@gmail.com 
 07/16/2011 08:53 AM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 Re: Which release to use?
 
 
 
 
 
 
 I am curious about the IBM product BigInishgts. Where can we download it? 
 It
 seems we have to register to download it?
 
 
 On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote:
 
  One quick clarification - IBM GA'd a product called BigInsights in 2Q. 
 It
  faithfully uses the Hadoop stack and many related projects - but 
 provides
  a number of extensions (that are compatible) based on customer requests.
  Not appropriate to say any more on this list, but the info on it is all
  publically available.
 
 
  
  Tom Deutsch
  Program Director
  CTO Office: Information Management
  Hadoop Product Manager / Customer Exec
  IBM
  3565 Harbor Blvd
  Costa Mesa, CA 92626-1420
  tdeut...@us.ibm.com
 
 
 
 
  Michael Segel michael_se...@hotmail.com
  07/15/2011 07:58 AM
  Please respond to
  common-user@hadoop.apache.org
 
 
  To
  common-user@hadoop.apache.org
  cc
 
  Subject
  RE: Which release to use?
 
 
 
 
 
 
 
  Unfortunately the picture is a bit more confusing.
 
  Yahoo! is now HortonWorks. Their stated goal is to not have their own
  derivative release but to sell commercial support for the official 
 Apache
  release.
  So those selling commercial support are:
  *Cloudera
  *HortonWorks
  *MapRTech
  *EMC (reselling MapRTech, but had announced their own)
  *IBM (not sure what they are selling exactly... still seems like smoke 
 and
  mirrors...)
  *DataStax
 
  So while you can use the Apache release, it may not make sense for your
  organization to do so. (Said as I don the flame retardant suit...)
 
  The issue is that outside of HortonWorks which is stating that they will
  support the official Apache release, everything else is a derivative 
 work
  of Apache's Hadoop. From what I have seen, Cloudera's release is the
  closest to the Apache release.
 
  Like I said, things are getting interesting.
 
  HTH
 
 
 
 
 
 
 -- 
 --- Get your facts first, then you can distort them as you please.--

RE: Which release to use?

2011-07-18 Thread Michael Segel


Well that's CDH3. :-)

And yes, that's because up until the past month... other releases didn't exist 
w commercial support.

Now there are more players as we look at the movement from leading edge to 
mainstream adopters.



 Subject: RE: Which release to use?
 Date: Mon, 18 Jul 2011 14:30:39 -0500
 From: jeff.schm...@shell.com
 To: common-user@hadoop.apache.org
 
 
  Most people are using CH3 - if you need some features from another
 distro use that - 
 
 http://www.cloudera.com/hadoop/
 
 I wonder if the Cloudera people realize that CH3 was a pretty happening
 punk band back in the day (if not they do now = )
 
 http://en.wikipedia.org/wiki/Channel_3_%28band%29
 
 cheers - 
 
 
 Jeffery Schmitz
 Projects and Technology
 3737 Bellaire Blvd Houston, Texas 77001
 Tel: +1-713-245-7326 Fax: +1 713 245 7678
 Email: jeff.schm...@shell.com
 Intergalactic Proton Powered Electrical Tentacled Advertising Droids!
 
 
 
 
 
 -Original Message-
 From: Michael Segel [mailto:michael_se...@hotmail.com] 
 Sent: Monday, July 18, 2011 2:10 PM
 To: common-user@hadoop.apache.org
 Subject: RE: Which release to use?
 
 
 Tom,
 
 I'm not sure that you're really honoring the purpose and approach of
 this list.
 
 I mean on the one hand, you're not under any obligation to respond or
 participate on the list. And I can respect that. You're not in an SD
 role so you're not 'customer facing' and not used to having to deal with
 these types of questions.
 
 On the other, you're not being free with your information. So when this
 type of question comes up, it becomes very easy to discount IBM as a
 release or source provider for commercial support.
 
 Without information, I'm afraid that I may have to make recommendations
 to my clients that may be out of date.
 
 There is even some speculation from analysts that recent comments from
 IBM are more of an indication that IBM is still not ready for prime
 time. 
 
 I'm sorry you're not in a position to detail your offering.
 
 Maybe by September you might be ready and then talk to our CHUG?
 
 -Mike
 
 
 
  To: common-user@hadoop.apache.org
  Subject: Re: Which release to use?
  From: tdeut...@us.ibm.com
  Date: Sat, 16 Jul 2011 10:29:55 -0700
  
  Hi Rita - I want to make sure we are honoring the purpose/approach of
 this 
  list. So you are welcome to ping me for information, but let's take
 this 
  discussion off the list at this point.
  
  
  Tom Deutsch
  Program Director
  CTO Office: Information Management
  Hadoop Product Manager / Customer Exec
  IBM
  3565 Harbor Blvd
  Costa Mesa, CA 92626-1420
  tdeut...@us.ibm.com
  
  
  
  
  Rita rmorgan...@gmail.com 
  07/16/2011 08:53 AM
  Please respond to
  common-user@hadoop.apache.org
  
  
  To
  common-user@hadoop.apache.org
  cc
  
  Subject
  Re: Which release to use?
  
  
  
  
  
  
  I am curious about the IBM product BigInishgts. Where can we download
 it? 
  It
  seems we have to register to download it?
  
  
  On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com
 wrote:
  
   One quick clarification - IBM GA'd a product called BigInsights in
 2Q. 
  It
   faithfully uses the Hadoop stack and many related projects - but 
  provides
   a number of extensions (that are compatible) based on customer
 requests.
   Not appropriate to say any more on this list, but the info on it is
 all
   publically available.
  
  
   
   Tom Deutsch
   Program Director
   CTO Office: Information Management
   Hadoop Product Manager / Customer Exec
   IBM
   3565 Harbor Blvd
   Costa Mesa, CA 92626-1420
   tdeut...@us.ibm.com
  
  
  
  
   Michael Segel michael_se...@hotmail.com
   07/15/2011 07:58 AM
   Please respond to
   common-user@hadoop.apache.org
  
  
   To
   common-user@hadoop.apache.org
   cc
  
   Subject
   RE: Which release to use?
  
  
  
  
  
  
  
   Unfortunately the picture is a bit more confusing.
  
   Yahoo! is now HortonWorks. Their stated goal is to not have their
 own
   derivative release but to sell commercial support for the official 
  Apache
   release.
   So those selling commercial support are:
   *Cloudera
   *HortonWorks
   *MapRTech
   *EMC (reselling MapRTech, but had announced their own)
   *IBM (not sure what they are selling exactly... still seems like
 smoke 
  and
   mirrors...)
   *DataStax
  
   So while you can use the Apache release, it may not make sense for
 your
   organization to do so. (Said as I don the flame retardant suit...)
  
   The issue is that outside of HortonWorks which is stating that they
 will
   support the official Apache release, everything else is a derivative
 
  work
   of Apache's Hadoop. From what I have seen, Cloudera's release is the
   closest to the Apache release.
  
   Like I said, things are getting interesting.
  
   HTH
  
  
  
  
  
  
  -- 
  --- Get your facts first, then you can distort them as you please.--

RE: Which release to use?

2011-07-18 Thread Michael Segel

 Date: Mon, 18 Jul 2011 18:19:38 -0700
 Subject: Re: Which release to use?
 From: mcsri...@gmail.com
 To: common-user@hadoop.apache.org

 Mike,

 Just a minor inaccuracy in your email. Here's setting the record straight:

 1. MapR directly sells their distribution of Hadoop. Support is from  MapR.
 2. EMC also sells the MapR distribution, for use on any hardware. Support is
 from EMC worldwide.
 3. EMC also sells a Hadoop appliance, which has the MapR distribution
 specially built for it. Support is from EMC.

 4. MapR also has a free, unlimited, unrestricted version called M3, which
 has the same 2-5x performance, management and stability improvements, and
 includes NFS. It is not crippleware, and the unlimited, unrestricted, free
 use does not expire on any date.

 Hope that clarifies what MapR is doing.

 thanks  regards,
 Srivas.

Srivas,

I'm sorry, I thought I was being clear in that I was only addressing EMC and 
not MapR directly.
I was responding to post about EMC selling a Greenplum appliance. I wanted to 
point out that EMC will resell MapR's release along with their own (EMC) 
support.

The point I was trying to make was that with respect to derivatives of Hadoop, 
I believe that MapR has a more compelling story than either EMC or DataStax. 
IMHO replacing Java HDFS w either GreenPlum or Cassandra has a limited market.  
When a company is going to look at a M/R solution cost and performance are 
going to be at the top of the list. MapR isn't cheap but if you look at the 
features in M5, if they work, then you have a very compelling reason to look at 
their release. Some of the people I spoke to when I was in Santa Clara were in 
the beta program. They indicated that MapR did what they claimed. 

Things are definitely starting to look interesting.

-Mike

 On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel
 michael_se...@hotmail.comwrote:

  EMC has inked a deal with MapRTech to resell their release and support
  services for MapRTech.
  Does this mean that they are going to stop selling their own release on
  Greenplum? Maybe not in the near future, however,
  a Greenplum appliance may not get the customer transaction that their
  reselling of MapR will generate.

  It sounds like they are hedging their bets and are taking an 'IBM'
  approach.

   Subject: RE: Which release to use?
   Date: Mon, 18 Jul 2011 08:30:59 -0500
   From: jeff.schm...@shell.com
   To: common-user@hadoop.apache.org

   Steve,

   I read your blog nice post - I believe EMC is selling the Greenplumb
   solution as an appliance -

   Cheers -

   Jeffery

   -Original Message-
   From: Steve Loughran [mailto:ste...@apache.org]
   Sent: Friday, July 15, 2011 4:07 PM
   To: common-user@hadoop.apache.org
   Subject: Re: Which release to use?

   On 15/07/2011 18:06, Arun C Murthy wrote:
Apache Hadoop is a volunteer driven, open-source project. The
   contributors to Apache Hadoop, both individuals and folks across a
   diverse set of organizations, are committed to driving the project
   forward and making timely releases - see discussion on hadoop-0.23 with
   a raft newer features such as HDFS Federation, NextGen MapReduce and
   plans for HA NameNode etc.

As with most successful projects there are several options for
   commercial support to Hadoop or its derivatives.

However, Apache Hadoop has thrived before there was any commercial
   support (I've personally been involved in over 20 releases of Apache
   Hadoop and deployed them while at Yahoo) and I'm sure it will in this
   new world order.

We, the Apache Hadoop community, are committed to keeping Apache
   Hadoop 'free', providing support to our users and to move it forward at
   a rapid rate.

   Arun makes a good point which is that the Apache project depends on
   contributions from the community to thrive. That includes

 -bug reports
 -patches to fix problems
 -more tests
 -documentation improvements: more examples, more on getting started,
   troubleshooting, etc.

   If there's something lacking in the codebase, and you think you can fix
   it, please do so. Helping with the documentation is a good start, as it
   can be improved, and you aren't going to break anything.

   Once you get into changing the code, you'll end up working with the head

   of whichever branch you are targeting.

   The other area everyone can contribute on is testing. Yes, Y! and FB can

   test at scale, yes, other people can test large clusters too -but nobody

   has a network that looks like yours but you. And Hadoop does care about
   network configurations. Testing beta and release candidate releases in
   your infrastructure, helps verify that the final release will work on
   your site, and you don't end up getting all the phone calls about
   something not working

RE: Hadoop upgrade Java version

2011-07-18 Thread Michael Segel

Any release after _21 seems to work fine.

 CC: highpoint...@gmail.com; common-user@hadoop.apache.org
 From: john.c.st...@gmail.com
 Subject: Re: Hadoop upgrade Java version
 Date: Mon, 18 Jul 2011 19:37:02 -0600
 To: common-user@hadoop.apache.org

 We're using u26 without any problems. 

 On Jul 18, 2011, at 4:45 PM, highpointe highpoint...@gmail.com wrote:

  So uhm yeah. Thanks for the Informica  commercial. 

  Now back to my original question. 

  Anyone have a suggestion on what version of Java I should be using with the 
  latest Hadoop release. 

  Sent from my iPhone

  On Jul 18, 2011, at 11:26 AM, high pointe highpoint...@gmail.com wrote:

  We are in the process of upgrading to the most current version of Hadoop.

  At the same time we are in need of upgrading Java.  We are currently 
  running u17.

  I have read elsewhere that u21 or up is the best route to go.  Currently 
  the version is u26.

  Has anyone gone all the way to u26 with or without issues?

  Thanks for the help.

RE: Which release to use?

2011-07-17 Thread Michael Segel


Well I'm sort of curious as to what is in the 'free' version which 
differentiates from the Apache release?

Earlier you wrote that IBM was faithful to the Apache release, plus a few 
'extras'. (I think I can find your exact quote and I'm sorry I'm paraphrasing 
your statements.)

This begs two questions...

1) What is IBM providing to your 'customers' to justify the uplift or premium 
for IBM's brand name.

2) If your release includes components which are not part of the Apache 
release, is it Apache's Hadoop? or considered a derivative?

The interesting thing about #2 is that I don't know if or what represents 
Hadoop. I mean if you take an earlier release of Hadoop like 20.2 where current 
is 20.203 and apply a subset of patches that are Apache committed, is this not 
Apache Hadoop or a derivative work since you are not 100% at the latest 
release. Note: This is a broader question than just what IBM is releasing but 
what is meant by saying Hadoop or derived from Hadoop. Clearly DataStax and 
MapR are derivatives. Cloudera? This goes back to the OP's question 'Which 
release to use?'...

And I have to apologize if I seem a bit suspect on what IBM has to say. When 
IBM first entered with an announced Hadoop release it was only for 32bit JVM 
and only on IBM's JVM. 
The last I heard, IBM's upsell was a configuration tool, which if anyone has 
built more than one Cloud/Cluster, its pretty much worthless.

So it would be interesting to see what IBM is really offering in this space. 

HTH

-Mike


 Subject: Re: Which release to use?
 From: tdeut...@us.ibm.com
 Date: Sun, 17 Jul 2011 14:07:20 -0600
 To: common-user@hadoop.apache.org
 
 There are two release levels - one is free but most of our customers want our 
 additional engineering so they use Enterprise Edition (which is not free).
  
 Happy to answer questions off list.
 
 ---
 Sent from my Blackberry so please excuse typing and spelling errors.
 
 
 - Original Message -
 From: Steve Loughran [ste...@apache.org]
 Sent: 07/17/2011 08:34 PM CET
 To: common-user@hadoop.apache.org
 Subject: Re: Which release to use?
 
 
 
 On 16/07/2011 16:53, Rita wrote:
  I am curious about the IBM product BigInishgts. Where can we download it? It
  seems we have to register to download it?
 
 
 I think you have to pay to use it

RE: Which release to use?

2011-07-15 Thread Michael Segel


Unfortunately the picture is a bit more confusing.

Yahoo! is now HortonWorks. Their stated goal is to not have their own 
derivative release but to sell commercial support for the official Apache 
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like smoke and 
mirrors...)
*DataStax 

So while you can use the Apache release, it may not make sense for your 
organization to do so. (Said as I don the flame retardant suit...)

The issue is that outside of HortonWorks which is stating that they will 
support the official Apache release, everything else is a derivative work of 
Apache's Hadoop. From what I have seen, Cloudera's release is the closest to 
the Apache release.

Like I said, things are getting interesting.

HTH

-Mike



 From: ev...@yahoo-inc.com
 To: common-user@hadoop.apache.org
 Date: Fri, 15 Jul 2011 07:35:45 -0700
 Subject: Re: Which release to use?
 
 Adarsh,
 
 Yahoo! no longer has its own distribution of Hadoop.  It has been merged into 
 the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, 
 and we are moving towards 0.20.204 which should be out soon.  I am not an 
 expert on Cloudera so I cannot really map its releases to the Apache 
 Releases, but their distro is based off of Apache Hadoop with a few bug fixes 
 and maybe a few features like append added in on top of it, but you need to 
 talk to Cloudera about the exact details.  For the most part they are all 
 very similar.  You need to think most about support, there are several 
 companies that can sell you support if you want/need it.  You also need to 
 think about features vs. stability.  The 0.20.203 release has been tested on 
 a lot of machines by many different groups, but may be missing some features 
 that are needed in some situations.
 
 --Bobby
 
 
 On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote:
 
 Hadoop releases are issued time by time. But one more thing related to
 hadoop usage,
 
 There are so many providers that provides the distribution of Hadoop ;
 
 1. Apache Hadoop
 2. Cloudera
 3. Yahoo
 
 etc.
 Which distribution is best among them on production usage.
 I think Cloudera's  is best among them.
 
 
 Best Regards,
 Adarsh
 Owen O'Malley wrote:
  On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:
 
 
  I'm a newbie and I am confused by the Hadoop releases.
  I thought 0.21.0 is the latest  greatest release that I
  should be using but I noticed 0.20.203 has been released
  lately, and 0.21.X is marked unstable, unsupported.
 
  Should I be using 0.20.203?
 
 
  Yes, I apologize for confusing release numbering, but the best release to 
  use is 0.20.203.0. It includes security, job limits, and many other 
  improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new 
  sync support so it isn't suitable for using with HBase. Most large clusters 
  use a separate version of HDFS for HBase.
 
  -- Owen

RE: Which release to use?

2011-07-15 Thread Michael Segel


See, I knew there was something that I forgot. 

It all goes back to the question ... 'which release to use'... 

2 years ago it was a very simple decision. Now, not so much. :-)

And while Arun and Ownen work for a vendor, I do not and I try to follow each 
company and their offering. 

As Hadoop goes mainstream, the question of which vendor to choose gets 
interesting. 
Just like in the 90's during the database vendor wars, it looks like the vendor 
who has the best sales force and PR will win.
(Not necessarily the best product.)

JMHO

-Mike


 Date: Fri, 15 Jul 2011 16:25:55 -0500
 Subject: Re: Which release to use?
 From: markkerz...@gmail.com
 To: common-user@hadoop.apache.org
 
 Steve,
 
 this is so well said, do you mind if I repeat it here,
 http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html
 
 Thank you,
 Mark
 
 On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote:
 
  On 15/07/2011 15:58, Michael Segel wrote:
 
 
  Unfortunately the picture is a bit more confusing.
 
  Yahoo! is now HortonWorks. Their stated goal is to not have their own
  derivative release but to sell commercial support for the official Apache
  release.
  So those selling commercial support are:
  *Cloudera
  *HortonWorks
  *MapRTech
  *EMC (reselling MapRTech, but had announced their own)
  *IBM (not sure what they are selling exactly... still seems like smoke and
  mirrors...)
  *DataStax
 
 
  + Amazon, indirectly, that do their own derivative work of some release of
  Hadoop (which version is it based on?)
 
  I've used 0.21, which was the first with the new APIs and, with MRUnit, has
  the best test framework. For my small-cluster uses, it worked well. (oh, and
  I didn't care about security)

Re: Deduplication Effort in Hadoop

2011-07-14 Thread Michael Segel

You don't have dupes because the key has to be unique.nbsp;



Sent from my Palm Pre on ATamp;T
On Jul 14, 2011 11:00 AM, jonathan.hw...@accenture.com 
lt;jonathan.hw...@accenture.comgt; wrote: 

Hi All,

In databases you can be able to define primary keys to ensure no duplicate data 
get loaded into the system.   Let say I have a lot of 1 billion records flowing 
into my system everyday and some of these are repeated data (Same records).   I 
can use 2-3 columns in the record to match and look for duplicates.   What is 
the best strategy of de-duplication?  The duplicated records should only appear 
within the last 2 weeks.I want a fast way to get the data into the system 
without much delay.  Anyway HBase or Hive can help?



Thanks!

Jonathan





This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.

RE: Performance Tunning

2011-06-28 Thread Michael Segel



Matthew,

I understood that Juan was talking about a 2 socket quad core box.  We run 
boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. 
Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. 

What I was saying was that if you consider 1 core for each TT, DN and RS jobs, 
thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread 
cores'.
So you could put up 10 m/r slots on the machine.  Note that on the main tasks 
(TT, DN, RS) I dedicate the physical core.

Of course your mileage may vary if you're doing non-standard or normal things.  
A good starting point is 6 mappers and 4 reducers. 
And of course YMMV depending on if you're using MapR's release, Cloudera, and 
if you're running HBase or something else on the cluster.

From our experience... we end up getting disk I/O bound first, and then 
network or memory becomes the next constraint. Really the xeon chipsets are 
really good. 

HTH

-Mike


 From: matthew.go...@monsanto.com
 To: common-user@hadoop.apache.org
 Subject: RE: Performance Tunning
 Date: Tue, 28 Jun 2011 14:46:40 +
 
 Mike,
 
 I'm not really sure I have seen a community consensus around how to handle 
 hyper-threading within Hadoop (although I have seen quite a few articles that 
 discuss it). I was assuming that when Juan mentioned they were 4-core boxes 
 that he meant 4 physical cores and not HT cores. I was more stating that the 
 starting point should be 1 slot per thread (or hyper-threaded core) but 
 obviously reviewing the results from ganglia, or any other monitoring 
 solution, will help you come up with a more concrete configuration based on 
 the load.
 
 My brain might not be working this morning but how did you get the 10 slots 
 again? That seems low for an 8 physical core box but somewhat overextending 
 for a 4 physical core box.
 
 Matt
 
 -Original Message-
 From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel 
 Segel
 Sent: Tuesday, June 28, 2011 7:39 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Performance Tunning
 
 Matt,
 You have 2 threads per core, so your Linux box thinks an 8 core box has16 
 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a 
 thread per slot so you end up w 10 slots per node. Of course memory is also a 
 factor.
 
 Note this is only a starting point.you can always tune up. 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:
 
  Per node: 4 cores * 2 processes = 8 slots
  Datanode: 1 slot
  Tasktracker: 1 slot
  
  Therefore max of 6 slots between mappers and reducers.
  
  Below is part of our mapred-site.xml. The thing to keep in mind is the 
  number of maps is defined by the number of input splits (which is defined 
  by your data) so you only need to worry about setting the maximum number of 
  concurrent processes per node. In this case the property you want to hone 
  in on is mapred.tasktracker.map.tasks.maximum and 
  mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of 
  other tuning improvements that can be made but it requires an strong 
  understanding of your job load.
  
  configuration
   property
 namemapred.tasktracker.map.tasks.maximum/name
 value2/value
   /property
  
   property
 namemapred.tasktracker.reduce.tasks.maximum/name
 value1/value
   /property
  
   property
 namemapred.child.java.opts/name
 value-Xmx512m/value
   /property
  
   property
 namemapred.compress.map.output/name
 valuetrue/value
   /property
  
   property
 namemapred.output.compress/name
 valuetrue/value
   /property
  
  
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.
 
 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for checking 
 for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.
 
 
 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export

RE: Any reason Hadoop logs cant be directed to a separate filesystem?

2011-06-25 Thread Michael Segel

Yes, and its called using cron and writing a simple ksh script to clear out any 
files that are older than 15 days. 
There may be another way, but that's really the easiest.

 Date: Thu, 23 Jun 2011 02:44:48 +0530
 From: jagaran_...@yahoo.co.in
 Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem?
 To: common-user@hadoop.apache.org

 Hi,

 Can I limit the log file duration ?
 I want to keep files for last 15 days only.

 Regards,
 Jagaran 

 From: Jack Craig jcr...@carrieriq.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Wed, 22 June, 2011 2:00:23 PM
 Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem?

 Thx to both respondents.

 Note i've not tried this redirection as I have only production grids 
 available.

 Our grids are growing and with them, log volume.

 As until now that log location has been in the same fs as the grid data,
 so running out of space due log bloat is a growing problem.

 From your replies, sounds like I can relocate my logs, Cool!

 But now the tough question, if i set up a too small partition and it runs out 
 of 
 space,
 will my grid become unstable if hadoop can no longer write to its logs?

 Thx again, jackc...

 Jack Craig, Operations
 CarrierIQ.comhttp://CarrierIQ.com
 1200 Villa Ct, Suite 200
 Mountain View, CA. 94041
 650-625-5456

 On Jun 22, 2011, at 1:09 PM, Harsh J wrote:

 Jack,

 I believe the location can definitely be set to any desired path.
 Could you tell us the issues you face when you change it?

 P.s. The env var is used to set the config property hadoop.log.dir
 internally. So as long as you use the regular scripts (bin/ or init.d/
 ones) to start daemons, it would apply fine.

 On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig 
 jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote:
 Hi Folks,

 In the hadoop-env.sh, we find, ...

 # Where log files are stored.  $HADOOP_HOME/logs by default.
 # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

 is there any reason this location could not be a separate filesystem on the 
 name 
 node?

 Thx, jackc...

 Jack Craig, Operations
 CarrierIQ.comhttp://CarrierIQ.com
 1200 Villa Ct, Suite 200
 Mountain View, CA. 94041
 650-625-5456

 --
 Harsh J

RE: Hadoop Cluster Multi-datacenter

2011-06-07 Thread Michael Segel


PWC now getting in to Hadoop? Interesting

Sanjeev, the simple short answer is that you don't create a cloud that spans a 
data center. Bad design.
You build two clusters one per data center.



 To: common-user@hadoop.apache.org
 Subject: Hadoop Cluster Multi-datacenter
 From: sanjeev.ta...@us.pwc.com
 Date: Mon, 6 Jun 2011 22:07:51 -0700
 
 Hello,
 
 I wanted to know if anyone has any tips or tutorials on howto install the 
 hadoop cluster on multiple datacenters
 
 Do you need ssh connectivity between the nodes across these data centers?
 
 Thanks in advance for any guidance you can provide.
 
 
 __
 The information transmitted, including any attachments, is intended only for 
 the person or entity to which it is addressed and may contain confidential 
 and/or privileged material. Any review, retransmission, dissemination or 
 other use of, or taking of any action in reliance upon, this information by 
 persons or entities other than the intended recipient is prohibited, and all 
 liability arising therefrom is disclaimed. If you received this in error, 
 please contact the sender and delete the material from any computer. 
 PricewaterhouseCoopers LLP is a Delaware limited liability partnership.  This 
 communication may come from PricewaterhouseCoopers LLP or one of its 
 subsidiaries.

RE: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Michael Segel


Chris,

I've gone back through the thread and here's Elton's initial question...

  On 06/06/11 08:22, elton sky wrote:
 
  hello everyone,
 
  As I don't have experience with big scale cluster, I cannot figure out
 why
  the inter-rack communication in a mapreduce job is significantly
 slower
  than intra-rack.
  I saw cisco catalyst 4900 series switch can reach upto 320Gbps
 forwarding
  capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
 not
  be
  much contention at the switch, is it?
 
Elton's question deals with why connections within the same switch are faster 
than connections that traverse a set of switches.
The issue isn't so much one of the fabric within the switch itself, but the 
width of the connection between the two switches.

If you have 40GBs (each direction) on a switch and you want it to communicate 
seamlessly with machines on the next switch, you have to have be able to bond 4 
10GBe ports together.
(Note: there's a bit more to it, but its the general idea.)

You're going to have a significant slow down on communication between nodes 
that are on different racks because of the bandwidth limitations on the ports 
used to connect the switches and not the 'fabric' within the switch itself.

To your point, you can monitor your jobs and see how much of your work is being 
done by 'data local' tasks. In one job we had 519 tasks started where 482 were 
'data local'. 
So we had ~93% of the jobs where we didn't have an issue with any network 
latency. And then with the 7% of the jobs, you have to consider what percentage 
would have occurred where the data traffic is going to involve pulling data 
across a 'trunk'.  So yes, network latency isn't going to be a huge factor in 
terms of improving overall efficiency.

However, that's just for Hadoop. What happens when you run HBase? ;-)
(You can have more network traffic during a m/r job.)

HTH

-Mike

RE: Dynamic Data Sets

2011-04-14 Thread Michael Segel


James,


If I understand you get a set of immutable attributes, then a state which can 
change. 

If you wanted to use HBase... 
I'd say create a unique identifier for your immutable attributes, then store 
the unique id, timestamp, and state. Assuming 
that you're really interested in looking at the state change over time.

So what you end up with is one table of immutable attributes, with a unique 
key, and then another table where you can use the same unique key and create 
columns with column names of time stamps with the state as the value.

HTH

-Mike



 Date: Wed, 13 Apr 2011 18:12:58 -0700
 Subject: Dynamic Data Sets
 From: selek...@yahoo.com
 To: common-user@hadoop.apache.org

 I have a requirement where I have large sets of incoming data into a
 system I own.

 A single unit of data in this set has a set of immutable attributes +
 state attached to it. The state is dynamic and can change at any time.
 What is the best way to run analytical queries on data of such nature
 ?

 One way is to maintain this data in a separate store, take a snapshot
 in point of time, and then import into the HDFS filesystem for
 analysis using Hadoop Map-Reduce. I do not see this approach scaling,
 since moving data is obviously expensive.
 If i was to directly maintain this data as Sequence Files in HDFS, how
 would updates work ?

 I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
 know that HBase works around this problem through multi version
 concurrency control techniques. Is that the only option ? Are there
 any alternatives ?

 Also note that all aggregation and analysis I want to do is time based
 i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
 use cases, is it advisable to use HDFS directly or use systems built
 on top of hadoop like Hive or Hbase ?

RE: live/dead node problem

2011-03-29 Thread Michael Segel


Rita,

When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes 
that the node is down. 

Per the Hadoop online documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A 
network partition can cause a 
subset of DataNodes to lose connectivity with the NameNode. The 
NameNode detects this condition by the 
absence of a Heartbeat message. The NameNode marks DataNodes without 
recent Heartbeats as dead and 
does not forward any new IO requests to them. Any data that was 
registered to a dead DataNode is not available to HDFS any more. 
DataNode death may cause the replication 
factor of some blocks to fall below their specified value. The NameNode 
constantly tracks which blocks need 
to be replicated and initiates replication whenever necessary. The 
necessity for re-replication may arise due 
to many reasons: a DataNode may become unavailable, a replica may 
become corrupted, a hard disk on a 
DataNode may fail, or the replication factor of a file may be 
increased. 


I was trying to find out if there's an hdfs-site parameter that could be set to 
decrease this time period, but wasn't successful.

HTH

-Mike



 Date: Tue, 29 Mar 2011 08:13:43 -0400
 Subject: live/dead node problem
 From: rmorgan...@gmail.com
 To: common-user@hadoop.apache.org

 Hello All,

 Is there a parameter or procedure to check more aggressively for a live/dead
 node? Despite me killing the hadoop process, I see the node active for more
 than 10+ minutes in the Live Nodes page. Fortunately, the last contact
 increments.


 Using, branch-0.21, 0985326

 --
 --- Get your facts first, then you can distort them as you please.--

RE: changing node's rack

2011-03-28 Thread Michael Segel


This may be weird, but I could have sworn that the script is called repeatedly.
One simple test would be to change the rack aware script and print a message 
out when the script is called.
Then change the script and see if it catches the change without restarting the 
cluster.

-Mike


 From: tdunn...@maprtech.com
 Date: Sat, 26 Mar 2011 15:50:58 -0700
 Subject: Re: changing node's rack
 To: common-user@hadoop.apache.org
 CC: rmorgan...@gmail.com
 
 I think that the namenode remembers the rack.  Restarting the datanode
 doesn't make it forget.
 
 On Sat, Mar 26, 2011 at 7:34 AM, Rita rmorgan...@gmail.com wrote:
 
  What is the best way to change the rack of a node?
 
  I have tried the following: Killed the datanode process. Changed the
  rackmap
  file so the node  and ip address entry reflect the new rack and I do a
  '-refreshNodes'. Restarted the datanode.  But it seems the datanode is keep
  getting register to the old rack.
 
  --
  --- Get your facts first, then you can distort them as you please.--

RE: CDH and Hadoop

2011-03-23 Thread Michael Segel

Rita,

Short answer...

Cloudera's release is free, and they do also offer a support contract if you
want support from them.
Cloudera has sources, but most use yum (redhat/centos) to download an already
built release.

Should you use it?
Depends on what you want to do.

If your goal is to get up and running with Hadoop and then focus on *using*
Hadoop/HBase/Hive/Pig/etc... then it makes sense.

If your goal is to do a deep dive in to Hadoop and get your hands dirty mucking
around with the latest and greatest in trunk? Then no. You're better off
building your own off the official Apache release.

Many companies choose Cloudera's release for the following reasons:
* Paid support is available.
* Companies focus on using a tech not developing the tech, so Cloudera does the
heavy lifting while Client Companies focus on 'USING' Hadoop.
* Cloudera's release makes sure that the versions in the release work together.
That is that when you down load CHD3B4, you get a version of Hadoop that will
work with the included version of HBase, Hive, etc ...

And no, its never a good idea to try and mix and match Hadoop from different
environments and versions in a cluster.
(I think it will barf on you.)

Does that help?

-Mike

Date: Wed, 23 Mar 2011 10:29:16 -0400
Subject: CDH and Hadoop
From: rmorgan...@gmail.com
To: common-user@hadoop.apache.org

I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/)
instead of the standard Hadoop distribution.

What do most people use? Is CDH free? do they provide the tars or does it
provide source code and I simply compile? Can I have some data nodes as CDH
and the rest as regular Hadoop?

I am asking this because so far I noticed a serious bug (IMO) in the
decommissioning process (
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e
)

--
--- Get your facts first, then you can distort them as you please.--

1 2 >

1 - 100 of 158 matches

Mail list logo