Re: Best Practices for Upgrading Hadoop Version?

2012-05-30 Thread Chris Smith
Michael Noll has a good description of the upgrade process here:
http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/

If may not quite reflect the versions of Hadoop you plan to upgrade but it
has some good pointers.

Chris

On 30 May 2012 09:12, ramon@accenture.com wrote:

 Hi,

   I did this upgrade on a similar cluster some weeks ago. I use the
 following method (all commands run with hadoop demons process owner):

 *   Stop cluster.
 *   Start only HDFS with :  start-dfs.sh -upgrade
 *   At this point the migration has started.
 *   You can check the status with hadoop dfsadmin -upgradeProgress
 status
 *   Now you can access files for reading.
 *   If you find any issue can rollback migration with : start-dfs.sh
 -rollback
 *   If everything seems ok you can mark the upgrade as finalized:
 hadoop dfsadmin -finalizeUpgrade


 -Original Message-
 From: Eli Finkelshteyn [mailto:iefin...@gmail.com]
 Sent: martes, 29 de mayo de 2012 20:29
 To: common-user@hadoop.apache.org
 Subject: Best Practices for Upgrading Hadoop Version?

 Hi,
 I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3.
 I'm running a pretty small cluster of just 4 nodes, and it's not really
 being used by too many people at the moment, so I'm OK if things get dirty
 or it goes offline for a bit. I was looking at the tutorial at
 wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it
 seems either outdated, or missing information. Namely, from what I've
 noticed so far, it doesn't specify what user any of the commands should be
 run as. Since I'm sure this is something a lot of people have needed to do,
 Is there a better tutorial somewhere for updating Hadoop version in general?

 Eli

 
 Subject to local law, communications with Accenture and its affiliates
 including telephone calls and emails (including content), may be monitored
 by our systems for the purposes of security and the assessment of internal
 compliance with Accenture policy.

 __

 www.accenture.com




Re: Moving blocks from a datanode

2012-05-22 Thread Chris Smith
M,

See http://wiki.apache.org/hadoop/FAQ - 3.6. I want to make a large
cluster smaller by taking out a bunch of nodes simultaneously. How can this
be done?

This explains how to decomission nodes by moving the data off of the
existing node.  It's fairly easy and painless (just add the nodename to the
slaves.exclude file and notify dfs)  and once the data is off the node you
could swap-out the disks and then re-introduce the node back into the
cluster with larger drives (removing the nodename from slaves.exclude).

Chris

On 17 May 2012 02:55, Mayuran Yogarajah
mayuran.yogara...@casalemedia.comwrote:

 Our cluster has several nodes which have smaller disks than other nodes and
 as a result fill up quicker.

 I am looking to move data off these nodes and onto the others.



 Here is what I am planning to do:

 1)  On the nodes with smaller disks, set dfs.datanode.du.reserved to a
 larger value

 2)  Restart data nodes

 3)  Run balancer



 Will this have the desired effect?

 If there is a better way to accomplish this please let me know.



 Thanks,

 M




Re: collecting CPU, mem, iops of hadoop jobs

2012-01-03 Thread Chris Smith
Have a look at OpenTSDB (http://opentsdb.net/overview.html) as this
does not have the same down sampling issue as Ganglia and stores the
metrics in HBase making it easier to access and process the data.
It's also pretty easy to add your own metrics.

Another useful utility is 'collectl'
(http://collectl.sourceforge.net/) which I tend to leave running in
the background on each node collecting, storing and managing machine
metrics locally - it's very lightweight.  When I have an issue that
requires a metric I forgot to capture with Ganglia I usually find it
in the 'collectl' logs - as long as I get to the logs before they roll
- usually a week.  This also doesn't have the down sampling issue but
it doesn't automatically agregate the data to a central database.

Regards,

Chris

On 21 December 2011 01:20, Arun C Murthy a...@hortonworks.com wrote:
 Go ahead and open a MR jira (would appreciate a patch too! ;) ).

 thanks,
 Arun

 On Dec 20, 2011, at 2:55 PM, Patai Sangbutsarakum wrote:

 Thanks again Arun, you save me again.. :-)

 This is a great starting point. for CPU and possibly Mem.

 For the IOPS, just would like to ask if the tasknode/datanode collect the 
 number
 or we should dig into OS level.. like /proc/PID_OF_tt/io
 ^hope this make sense

 -P

 On Tue, Dec 20, 2011 at 1:22 PM, Arun C Murthy a...@hortonworks.com wrote:
 Take a look at the JobHistory files produced for each job.

 With 0.20.205 you get CPU (slot millis).
 With 0.23 (alpha quality) you get CPU and JVM metrics (GC etc.). I believe 
 you also get Memory, but not IOPS.

 Arun

 On Dec 20, 2011, at 1:11 PM, Patai Sangbutsarakum wrote:

 Thanks for reply, but I don't think metric exposed to Ganglia would be
 what i am really looking for..

 what i am looking for is some kind of these (but not limit to)

 Job__
 CPU time: 10204 sec.   --aggregate from all tasknodes
 IOPS: 2344  -- aggregated from all datanode
 MEM: 30G   -- aggregated

 etc,

 Job_aaa_bbb
 CPU time:
 IOPS:
 MEM:

 Sorry for ambiguous question.
 Thanks

 On Tue, Dec 20, 2011 at 12:47 PM, He Chen airb...@gmail.com wrote:
 You may need Ganglia. It is a cluster monitoring software.

 On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum 
 silvianhad...@gmail.com wrote:

 Hi Hadoopers,

 We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
 CPU time, memory usage, IOPS of each hadoop Job.
 What would be the good starting point ? document ? api ?

 Thanks in advance
 -P





Re: Distributed sorting using Hadoop

2011-11-29 Thread Chris Smith
Madhu,

Try working your way through the MapReduce tutorial here:
http://hadoop.apache.org/common/docs/r0.20.205.0/mapred_tutorial.html#Example%3A+WordCount+v1.0
 that covers most of the concepts you require to do a distributed
sort.

Search for the worf, combiner, in the tutorial to understand about
combining results using the Mapper - to reduce cross cluster traffic.

Also work your way through several of the tutorials and videos on
working with Hadoop - Google is your friend here.

Another good source on the general algoritms is Jimmy Lin's book
referenced on this page:
http://www.umiacs.umd.edu/~jimmylin/book.html

Regards,

Chris

On 26 November 2011 13:05, madhu_sushmi madhu_sus...@yahoo.com wrote:

 Hi,
 I need to implement distributed sorting using Hadoop. I am quite new to
 Hadoop and I am getting confused. If I want to implement Merge sort, what my
 Map and reduce should be doing. ? Should all the sorting happen at reduce
 side?

 Please help. This is an urgent requirement. Please guide me.

 --
 View this message in context: 
 http://old.nabble.com/Distributed-sorting-using-Hadoop-tp32876787p32876787.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Running more than one secondary namenode

2011-10-12 Thread Chris Smith
Jorn,

If you've configured the Name Node fsimage and edit log replication to
both  NFS and Secondary Name Node and regularly backup the fsimage and
edit logs you would do better investing time in understanding exactly
how the Name Node builds up it's internal database and how it applies
it's edit logs; 'read the code, Luke'.

Then, if you really want to be prepared, you can then produce some
test scenarios by applying a corruption (that the Name Node can't
handle automatically) to the fsimage or edit logs on a sacrificial
system (VM?) and see if you can recover from this.  That way, if you
ever get hit with a Name Node corruption you'll be in a much better
place to recovery most/all of your data.

Even with the best setup it can happen if you hit  a 'corner case' scenario.

Chris

On 12 October 2011 08:50, Jorn Argelo - Ephorus jorn.arg...@ephorus.com wrote:
 Hi all,



 I was wondering if there are any (technical) issues with running two
 secondary namenodes on two separate servers rather than running just
 one. Since basically everything falls or stands with a consistent
 snapshot of the namenode fsimage I was considering to run two secondary
 namenodes for additional resilience. Has this been done before or am I
 being too paranoid? Are there any caveats with doing this?



 Thanks,



 Jorn




Re: Block Size

2011-09-29 Thread Chris Smith
On 29 September 2011 18:39, lessonz less...@q.com wrote:
 I'm new to Hadoop, and I'm trying to understand the implications of a 64M
 block size in the HDFS. Is there a good reference that enumerates the
 implications of this decision and its effects on files stored in the system
 as well as map-reduce jobs?

 Thanks.


Good explanation of HDFS here:
http://hadoop.apache.org/common/docs/current/hdfs_design.html

In a nutshell MapReduce moves the computation to the node that hosts
the data (block).

As there is an overhead in startup/teardown of each task you want to
make sure it has a reasonable amount of data to process, hence the
default block size of 64MB. Quite a few users run at larger block
sizes either as it's more efficient for their algorithmns or to reduce
the overhead on the Name Node, more blocks = more meta-data to hold in
the in-memory database.

Hope that helps.

Chris


Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread Chris Smith
Elton,

Rapleaf's blog has an interesting posting on their experience that's
worth a read:  
http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/

And if you want to get an idea of the interaction between CPU, Disk
and Network there nothing like a picture, see Slide 9 in this deck, of
a very simply Terasort Map/Reduce job.  Obviously the real world is
very different but the individual Map/Reduce jobs follow a similar
pattern.

Even doubling the node network performance in this simple example
would not get you much performance improvement as the job is CPU bound
for 50% of the time and only uses the network for roughly 10% of the
remaining time.

Chris

On 6 June 2011 16:42, Michael Segel michael_se...@hotmail.com wrote:

 Well the problem is pretty basic.

 Take your typical 1 GBe switch with 42 ports.
 Each port is capable of doing 1 GBe in each direction across the switche's 
 fabric.
 Depending on your hardware, that's a fabric of 40GB, shared.

 Depending on your hardware, you are usually using 1 or maybe 2 ports to 
 'trunk' to your network's back plane. (To keep this simple, lets just say 
 that its a 1-2 GBe 'trunk' to your next rack.
 So you end up with 1GBe traffic from each node trying to communicate to 
 another node on the next rack.  So if that's 20 nodes per rack and they all 
 want to communicate... you end up with 20 GBe (each direction) trying to fit 
 through a 1 - 2 GBe  pipe.

 Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people 
 don't know how to drive. :-P

 The quick fix... spend the 8-10K per switch  to get a ToR that has 10+ GBe 
 uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per rack.

 JMHO

 -Mike



 To: common-user@hadoop.apache.org
 Subject: Re: Why inter-rack communication in mapreduce slow?
 Date: Mon, 6 Jun 2011 11:00:05 -0400
 From: dar...@ontrenet.com


 IMO, that's right. Because map/reduce/hadoop was originally designed for
 that kind of text processing purpose. (i.e. few stages, low dependency,
 highly parallel).

 Its when one tries to solve general purpose algorithms of modest
 complexity that map/reduce gets into I/O churning problems.

 On Mon, 6 Jun 2011 23:58:53 +1000, elton sky eltonsky9...@gmail.com
 wrote:
  Hi John,
 
  Because for map task, job tracker tries to assign them to local data
 nodes,
  so there' not much n/w traffic.
  Then the only potential issue will be, as you said, reducers, which
 copies
  data from all maps.
  So in other words, if the application only creates small intermediate
  output, e.g. grep, wordcount, this jam between racks is not likely
 happen,
  is it?
 
 
  On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
  john.armstr...@ccri.comwrote:
 
  On Mon, 06 Jun 2011 09:34:56 -0400, dar...@ontrenet.com wrote:
   Yeah, that's a good point.
  
   I wonder though, what the load on the tracker nodes (port et. al)
 would
   be if a inter-rack fiber switch at 10's of GBS' is getting maxed.
  
   Seems to me that if there is that much traffic being mitigate across
   racks, that the tracker node (or whatever node it is) would overload
   first?
 
  It could happen, but I don't think it would always.  For example,
 tracker
  is on rack A; sees that the best place to put reducer R is on rack B;
  sees
  reducer still needs a few hellabytes from mapper M on rack C; tells M
 to
  send data to R; switches on B and C get throttled, leaving A free to
  handle
  other things.
 
  In fact, it almost makes me wonder if an ideal setup is not only to
 have
  each of the main control daemons on their own nodes, but to put THOSE
  nodes
  on their own rack and keep all the data elsewhere.
 



Re: tips and tools to optimize cluster

2011-05-24 Thread Chris Smith
Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
precision on the historical data.
It also has some neat tracks around the collection and display of data.

Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
which is a light weight Perl script that
both captures and compresses the metrics, manages it's metrics data
files and then filters and presents
the metrics as requested.

I find collectl lightweight and useful enough that I set it up to
capture everything and
then leave it running in the background on most systems I build
because when you need the measurement
data the event is usually in the past and difficult to reproduce.
With collectl running I have a week to
recognise the event and analyse/save the relevant data file(s); data
file approx. 21MB/node/day gzipped.

With a little bit of bash or awk or perl scripting you can convert the
collectl output into a form easily
loadable into Pig.  Pig also has User Defined Functions (UDFs) that
can import the Hadoop job history so
with some Pig Latin you can marry your infrastructure metrics with
your job metrics; a bit like the cluster
eating it own dog food.

BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
report the full jvm metrics via gmond
although if you output the jvm metrics to file you get a record for
each jvm on the node.  I haven't looked
into it in detail yet but it looks like Gangla only reports the last
jvm record in each batch. Anyone else seen
this?

Chris

On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 I'm looking for tips, tricks and tools to get at node utilization to
 optimize our cluster.  I want answer questions like:
 - what nodes ran a particular job?
 - how long did it take for those nodes to run the tasks for that job?
 - how/why did Hadoop pick those nodes to begin with?

 More detailed questions like
 - how much memory did the task for the job use on that node?
 - average CPU load on that node during the task run

 And more aggregate questions like:
 - are some nodes favored more than others?
 - utilization averages (generally, how many cores on that node are in use, 
 etc.)

 There are plenty more that I'm not asking, but you get the point?  So,
 what are you guys using for this?

 I see some mentions of Ganglia, so I'll definitely look into that.
 Anything else?  Anything you're using to monitor in real-time (like a
 'top' across the nodes or something like that)?

 Any info or war-stories greatly appreciated.

 Thanks,

 Tom



Re: the question of hadoop

2010-09-08 Thread Chris Smith
2010/9/6 褚 鵬兵 chu_pengb...@hotmail.com:

 hi ,my hadoop friends:i have the 3 questions about hadoop.there are 

 1 the speed between the datanodes.   Tera data in one datanodes ,   the data  
 transfers from one datanode to the another datanode.   if the speed  is bad, 
 Hadoop will be slow, i think.   i heard the gNet architecture in Greenplum ,  
 then hadoop ?  SAS storage + G-Ethernet is best answer, isn't it?
 2 the GUI tool   there is a hive web tool in hadoop.   but it is not enough 
 to use it for our business work.   it is too simple to use it.
   if hadoop+hive is designed into DWH.   then how to use it for users.   by 
 CGI Tool(Command),?   by New Developed webGUITOOL.?
 3 5 computers Hadoop cluster and 1 computer SQLSERVER2000   5 computers 
 Hadoop      celeron 2.66G      1G memory      Ethernet      namenode + 
 secondarynamenode + 3 datanode   1 computer SQLSERVER2000      celeron 2.66G  
     1G memory  then i did select operation at the same data 100M .    5 
 computers Hadoop  is 2mins 30secs   1 computer SQLSERVER2000  is 2mins 25secs
 the result is that  5 computers Hadoop is not good .why .can anyone give me 
 some advises.
 thanks in adverse.


Why use Hadoop in preference to a database?

At the recent Hadoop User Group (UK) meeting, Andy Kemp from
http://www.forward.co.uk/ presented their experience in moving from a
MySQL database approach to Hadoop.

From my notes of his talk their system manages 120 million keywords
and is updated at a rate of 20GB/day.

They originally used a sharded MySQL database but found it couldn't
scale to handle the types of queries their users required, e.g. Can
you cluster 17(?) million keyword phrases into thematic groups?.
Their calculations indicated that the database approach would take
more than a year to handle such a query.

Moving to a cluster of 100 Hadoop nodes on Amazon EC2 reduced this
time down to 7 hours.  The issues then became one of the costs of
storage and moving the data to and from the cluster.

They then moved to a private VM system with about 30 VMs - I assume
the processing took the same time as I didn't note this down.

From there they then moved to dedicated hardware, 5 dedicated Hadoop
nodes, and achieved better performance than the 30 VMs.

Andy's talk, Hadoop in Context should available as a podcast here
http://skillsmatter.com/podcast/cloud-grid/hadoop-in-context and would
be well worth watching but when I lasted looked it hadn't been
uploaded yet.

At the same event, Ian Broadhead, from http://www.playfish.com/ gave a
talk on managing the activity of over 1 million active Internet gamers
producing over 50GB of data a day.  Their original MySQL system took
up to 50 times longer to process their data load than an EC2 cluster
of Hadoop nodes.  He talked about a typical workload being reduced
from 2-3 days (using MySQL) down to 6 hours (using Hadoop).
Unfortunately I don't think Ian's talk will appear as a podcast.

However, most presentations during the evening made a point that
Hadoop didn't completely replace their databases, just provided a
convenient way to rapidly process large volumes of data, the output
from Hadoop processing typically being stored in databases to satisfy
general everyday business queries.

I think the common theme here was that all of these users had large
datasets of the order of 100's of GBs with multiple views of that data
that handled in the order of 10's of millions of updates a day.

I hope that helps.

Chris


RE: Question about disk space allocation in hadoop

2010-06-30 Thread Chris Smith
Some thoughts on how to restrict the temporary data, but I have only
tried (a) in anger:

a)    Partition your disks into HDFS and intermediate temp partitions
of the relevant size.  This gives a fixed separation but is
difficult/impossible to modify on a busy cluster especially as there
may be no way of unloading/recovering the data stored in HDFS if you
make a mistake resizing partitions;

b)  Implement disk quotas and set relevant hard and soft limits on
the relevant root directories for intermediate space. This gives you
the flexibility to change the limits when required but as the limits
are per user/group some thought may be required as to which user/group
the limits apply to. There may also be a performance impact?

You could combine this with setting “dfs.datanode.du.reserved” value
in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage.

c)  Implement intermediate data space as a loopback file, see:
http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk
This example implements a temporary loopback filesystem on a iSCSI
mounted Lustre filesystem but the principles are the same. There are
some performance benchmarks linked to in section 3. The intermediate
temp data space is limited by the size of the loopback file created.

Chris

-Original Message-
From: Yu Li [mailto:car...@gmail.com]
Sent: 30 June 2010 04:11
To: common-user@hadoop.apache.org
Subject: Re: Question about disk space allocation in hadoop

Hi all,

Anybody has experience on this? Any Comments/Suggestions would be
highly appreciated, Thanks.

Best Regards,
Carp

2010/6/29 Yu Li car...@gmail.com:
 Hi all,

 As we all know, machines in hadoop cluster may be both datanode and
 tasktracker, so one machine may store both MR job intermediate data
 and HDFS data. My question is: if we have more than one disk per node,
 say 4 disks, and would like both job intermediate data and HDFS data
 store into all disks to reduce IO times of each single disk, can we
 draw a line between space of local FS and HDFS? For example, restrict
 the intermediate temp data occupy no more than 25% space on each disk?
 Thanks in advance.

 Best Regards,
 Carp