Re: Avro, Hadoop0.20.2, Jackson Error

2012-03-26 Thread Scott Carey
Does it still happen if you configure

avro-tools to use 

dependency
  groupIdorg.apache.avro/groupId
  artifactIdavro-tools/artifactId
  version1.6.3/version
  classifiernodeps/classifier
/dependency


?

You have two hadoop's, two jacksons, and even two avro:avro artifacts in
your classpath if you use the avro bundle jar with a default classifier.

avro-tools jar is not intended for inclusion in a project, as it is a jar
with dependencies inside.
https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD
ocumentation-ProjectStructure

On 3/26/12 7:52 PM, Deepak Nettem deepaknet...@gmail.com wrote:

When I include some Avro code in my Mapper, I get this error:

Error:
org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F
eature;)Lorg/codehaus/jackson/JsonFactory;

Particularly, just these two lines of code:

InputStream in =
getClass().getResourceAsStream(schema.avsc);
Schema schema = Schema.parse(in);

This code works perfectly when run as a stand alone application outside of
Hadoop. Why do I get this error? and what's the best way to get rid of it?

I am using Hadoop 0.20.2, and writing code in the new API.

I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar
and jackson-mapper-asl-1.0.1.jar.

I removed these, but got this error:
hadoop Exception in thread main java.lang.
NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException

I am using Maven as a build tool, and my pom.xml has this dependency:

dependency
groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-mapper-asl/artifactId
  version1.5.2/version
  scopecompile/scope
/dependency




I added the dependency:


dependency
groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-core-asl/artifactId
  version1.5.2/version
  scopecompile/scope
/dependency

But that still gives me this error:

Error: org.codehaus.jackson.
JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus
/jackson/JsonFactory;

-

I also tried replacing the earlier dependencies with these:

   dependency
groupIdorg.apache.avro/
groupId
artifactIdavro-tools/artifactId
version1.6.3/version
/dependency

dependency
groupIdorg.apache.avro/groupId
artifactIdavro/artifactId
version1.6.3/version
/dependency


dependency
groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-mapper-asl/artifactId
  version1.8.8/version
  scopecompile/scope
/dependency

dependency
groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-core-asl/artifactId
  version1.8.8/version
  scopecompile/scope
/dependency

And this is my app dependency tree:

[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest ---
[INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT
[INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile)
[INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
[INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile
[INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile
[INFO] |  +- commons-beanutils:commons-beanutils:jar:1.8.0:compile
[INFO] |  +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  +- commons-lang:commons-lang:jar:2.4:compile
[INFO] |  +- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] |  \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile
[INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] +- org.apache.avro:avro:jar:1.6.3:compile
[INFO] |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
[INFO] |  \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
[INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile
[INFO]+- commons-cli:commons-cli:jar:1.2:compile
[INFO]+- xmlenc:xmlenc:jar:0.52:compile
[INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile
[INFO]+- commons-codec:commons-codec:jar:1.3:compile
[INFO]+- commons-net:commons-net:jar:1.4.1:compile
[INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile
[INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile
[INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile
[INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile
[INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
[INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
[INFO]|  \- ant:ant:jar:1.6.5:compile
[INFO]+- commons-el:commons-el:jar:1.0:compile
[INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO]+- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
[INFO]+- net.sf.kosmosfs:kfs:jar:0.3:compile
[INFO]+- hsqldb:hsqldb:jar:1.8.0.10:compile
[INFO]+- oro:oro:jar:2.0.8:compile
[INFO]\- org.eclipse.jdt:core:jar:3.1.1:compile

I still get the same error.

Somebody please please help me with this. I need to resolve this asap!!

Best,
Deepak



Re: HDFS Backup nodes

2011-12-14 Thread Scott Carey


On 12/13/11 11:00 PM, M. C. Srivas mcsri...@gmail.com wrote:

Suresh,

As of today, there is no option except to use NFS.  And as you yourself
mention, the first HA prototype when it comes out will require NFS.

How will it 'require' NFS?  Won't any 'remote, high availability storage'
work?  NFS is unreliable unless in my experience unless:
* Its a Netapp
* Its based on Solaris
(caveat: I have only used 5 NFS solution types over the last decade, and
the issues are not data integrity, rather availability from a client
perspective)


A solution with a brief 'stall' in service while a SAN mount switched over
or similar with drbd should be possible and data safe, if this is being
built to truly 'require' NFS that is no better for me than the current
situation, which we manage using OS level tools for failover that will
temporarily break clients but resume availability quickly thereafter.
Where I would like the most help from hadoop is in making the failover
transparent to clients, not in solving the reliable storage problem or
failover scenarios that Storage and OS vendors do.


(a) I wasn't aware that Bookkeeper had progressed that far. I wonder
whether it would be able to keep up with the data rates that is required
in
order to hold the NN log without falling behind.

(b) I do know Karthik Ranga at FB just started a design to put the NN data
in HDFS itself, but that is in very preliminary design stages with no real
code there.

The problem is that the HA code written with NFS in mind is very different
from the HA code written with HDFS in mind, which are both quite different
from the code that is written with Bookkeeper in mind. Essentially the
three options will form three different implementations, since the failure
modes of each of the back-ends are different. Am I totally off base?

thanks,
Srivas.




On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas
sur...@hortonworks.comwrote:

 Srivas,

 As you may know already, NFS is just being used in the first prototype
for
 HA.

 Two options for editlog store are:
 1. Using BookKeeper. Work has already completed on trunk towards this.
This
 will replace need for NFS to  store the editlogs and is highly
available.
 This solution will also be used for HA.
 2. We have a short term goal also to enable editlogs going to HDFS
itself.
 The work is in progress.

 Regards,
 Suresh


 
  -- Forwarded message --
  From: M. C. Srivas mcsri...@gmail.com
  Date: Sun, Dec 11, 2011 at 10:47 PM
  Subject: Re: HDFS Backup nodes
  To: common-user@hadoop.apache.org
 
 
  You are out of luck if you don't want to use NFS, and yet want
redundancy
  for the NN.  Even the new NN HA work being done by the community
will
  require NFS ... and the NFS itself needs to be HA.
 
  But if you use a Netapp, then the likelihood of the Netapp crashing is
  lower than the likelihood of a garbage-collection-of-death happening
in
 the
  NN.
 
  [ disclaimer:  I don't work for Netapp, I work for MapR ]
 
 
  On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:
 
   Thanks Joey. We've had enough problems with nfs (mainly under very
high
   load) that we thought it might be riskier to use it for the NN.
  
   randy
  
  
   On 12/07/2011 06:46 PM, Joey Echeverria wrote:
  
   Hey Rand,
  
   It will mark that storage directory as failed and ignore it from
then
   on. In order to do this correctly, you need a couple of options
   enabled on the NFS mount to make sure that it doesn't retry
   infinitely. I usually run with the
tcp,soft,intr,timeo=10,**retrans=10
   options set.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
  
   What happens then if the nfs server fails or isn't reachable? Does
 hdfs
   lock up? Does it gracefully ignore the nfs copy?
  
   Thanks,
   randy
  
   - Original Message -
   From: Joey Echeverriaj...@cloudera.com
   To: common-user@hadoop.apache.org
   Sent: Wednesday, December 7, 2011 6:07:58 AM
   Subject: Re: HDFS Backup nodes
  
   You should also configure the Namenode to use an NFS mount for
one of
   it's storage directories. That will give the most up-to-date back
of
   the metadata in case of total node failure.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
 praveen...@gmail.com
wrote:
  
   This means still we are relying on Secondary NameNode idealogy
for
   Namenode's backup.
   Can OS-mirroring of Namenode is a good alternative keep it alive
all
  the
   time ?
  
   Thanks,
   Praveenesh
  
   On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
   mahesw...@huawei.comwrote:
  
AFAIK backup node introduced in 0.21 version onwards.
   __**__
   From: praveenesh kumar [praveen...@gmail.com]
   Sent: Wednesday, December 07, 2011 12:40 PM
   To: common-user@hadoop.apache.org
   Subject: HDFS Backup nodes
  
   Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
  
   Thanks,
   Praveenesh
  
  
  
  
   --
   Joseph 

Re: More cores Vs More Nodes ?

2011-12-14 Thread Scott Carey


On 12/14/11 9:05 AM, Michael Segel michael_se...@hotmail.com wrote:



Brian,

I think you missed my point.

The moment you go and design a cluster for a specific job, you end up
getting fscked because there's another group who wants to use the shared
resource for their job which could be orthogonal to the original purpose.
It happens everyday.

This is why you have to ask if the cluster is being built for a specific
purpose. Meaning answering the question 'Which of the following best
describes your cluster:
a) PoC
b) Development
c) Pre-prod
d) Production
e) Secondary/Backup


Note that sizing the cluster is a different matter.
Meaning if you know you need a PB of storage, you're going to design the
cluster differently because once you get to a certain size, you have to
recognize that your clusters are going to have lots of disk, require
10GBe just for the storage. Number of cores would be less of an issue,
however again look at pricing. 2 socket 8 core Xeon MBs are currently at
an optimal price point.

Recently, single socket servers have been 9 to 12 months ahead of the
curve on next generation processor availability.

I found 1 socket quad core Xeon a better value because a single socket 4
core system performs at the CPU level of ~5.5 cores of a dual socket
system due to faster Ghz and newer generation processors on the single
socket system -- At least earlier this year.   Sandy Bridge is finally
moving to dual socket.   Single socket quad core Xeon at 3.4Ghz is much
more than half as capable as dual socket quad @2.66Ghz.

1 socket versus 2 is a moving target.

In our case, we had a $ budget and a low power/rack capacity.   We
compared what we could get for various designs in terms of:

aggregate CPU  (CPU core count * Ghz)
aggregate Memory bandwidth
aggregate RAM
aggregate Disk capacity
aggregate network throughput

And chose the single socket, 1U system based on our constraints and what
we could get with a variety of designs (all single socket or dual socket,
1U and 2U nodes, 4 to 12 drives / node).   We had a range of acceptable
Storage to CPU ratio, CPU to RAM ratio, and network to storage ratio.
With fewer CPU we had fewer disk and less RAM per machine, but more total
servers.  This was also influenced by availability concerns -- the more
disk per node, the faster your network per node needs to be in order to
replicate on a failure.  Smaller servers meant significantly cheaper
network since bonded 1Gb link pairs were good enough.

Given various constraints and needs different organizations will find
different sweet spots.  And given the hardware available at the time, the
sweet spot moves as well.

 

And again this goes back to the point I was trying to make.
You need to look beyond the number of cores as a determining factor.
You go too small, you're going to take a hit because of the
price/performance curve.
(Remember that you have to consider Machine Room real estate. 100 2 core
boxes take up much more space than 25 8 core boxes)

If you go to the other extreme... 64 core giant SMP box $ for $$$
(less money) build out an 8 node cluster.

Beyond that, you really, really don't want to build a custom cluster for
a specific job unless you know that you're going to be running that
specific job or set of jobs (24x7X365) [And yes, I came across such a use
case...]

HTH

-Mike
 From: bbock...@cse.unl.edu
 Subject: Re: More cores Vs More Nodes ?
 Date: Wed, 14 Dec 2011 07:41:25 -0600
 To: common-user@hadoop.apache.org
 
 Actually, there are varying degrees here.
 
 If you have a successful project, you will find other groups at your
door wanting to use the cluster too.  Their jobs might be different from
the original use case.
 
 However, if you don't understand the original use case (CPU heavy or
storage heavy? is a great beginning question), your original project
won't be successful.  Then there will be no follow-up users because you
failed.
 
 So, you want to have a reasonably general-purpose cluster, but make
sure it matches well with the type of jobs.  As an example, we had one
group who required an estimated CPU-millenia per byte of dataŠ they
needed a general purpose cluster for a certain value of general
purpose.
 
 Brian
 
 On Dec 14, 2011, at 7:29 AM, Michael Segel wrote:
 
  
  Aw Tommy, 
  Actually no. You really don't want to do this.
  
  If you actually ran a cluster and worked in the real world, you would
find that if you purposely build a cluster for one job, there will be a
mandate that some other group needs to use the cluster and that their
job has different performance issues and your cluster is now suboptimal
for their jobs...
  
  Perhaps you meant that you needed to think about the purpose of the
cluster? That is do you want to minimize the nodes but maximize the disk
space per node and use the cluster as your backup cluster? (Assuming
that you are considering your DR and BCP in your design.)
  
  The problem with your answer, is that a job has a specific meaning
within 

Re: Read() block mysteriously when using big BytesPerChecksum size

2010-11-10 Thread Scott Carey

On Oct 7, 2010, at 2:35 AM, elton sky wrote:

 Hello experts,
 
 I was benchmarking sequential write throughput of HDFS.
 
 For testing affect of bytesPerChecksum (bpc) size to write performance, I am
 using different bpc size: 2M, 256K, 32K, 4K, 512B.
 
 My cluster has 1 name node and 5 data nodes. They are xen VMs and each of
 them configured with 56MB/s duplex ethernet connection. I
 
 I try to create a 10G file with different bpc. When bpc is 2M, the
 throughput drops dramatically compared with others:
 
 time(ms): 333008  bpc: 2M
 
 time(ms): 234180  bpc: 256K
 
 time(ms): 223737  bpc: 32K
 
 time(ms): 228842  bpc: 4K
 
 time(ms): 228238  bpc: 512
 
 After dig into the source, I found the problem happens on data nodes.
 In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket():
 
 private int readNextPacket() throws IOException {
 ...
 
 while (buf.remaining()  SIZE_OF_INTEGER) {
 
 if (buf.position()  0) {
shiftBufData();
  }
 
 *  readToBuf(-1); // this line takes 30ms or more for each packet before
 returns*
}
 ...
 
 while (toRead  0) { //this loop also takes around 30 ms
toRead -= readToBuf(toRead);
  }
 ...
 }
 
 private long readToBufTime(int toRead) throws IOException {
 ...
 
 *int nRead = in.read(buf.array(), buf.limit(), toRead);**// this is the line
 actually causes the delay*
 ...
 
 }
 
 The *in.read() *takes around 30ms to wait for data before it returns. And
 when it returns it reads a few KBs data.  The while loop comes later takes
 similar time to finish, which reads (2MB - a few KBs reads before).
 
 I couldn't understand the reason for the pause of *in.read()*. Why data node
 needs to wait?  why data is not available then?

It is probably waiting on disk or network.
  Why this happens when using
 big bpc?
 

Linux tends to asynchronously 'read-ahead' from disks if sequential access is 
detected in a file.  The default is to read-ahead in chunks of up to 128K.  You 
can change this on a per device level with blockdev --setra (google it).
Since Hadoop fetches data in a synchronous loop, it loses the benefit of the OS 
asynchronous read-ahead past 128K unless you change that setting.

I recommend a readahead value of ~2MB for today's SATA drives if you need top 
sequential access performance from linux.  This would look something like this 
for 2MB:

# blockdev --setra 4096 /dev/sda


 any idea will be appreciated!



Re: Hadoop performance - xfs and ext4

2010-05-11 Thread Scott Carey
Did you try the XFS 'allocsize' mount parameter (for example, allocsize=8m)?  
This will reduce fragmentation during concurrent writes.   
Its more complicated, but using separate partitions for temp space versus HDFS 
also has an effect.  XFS isn't as good with the temp space.

In short, a single test with default configurations is useful, but doesn't 
complete the picture.  Both file systems have several important tuning knobs.


On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote:

 Hi,
 
 I've been tweaking our cluster roll-out process to refine it. While 
 doing so, I decided to check if XFS gives any performance benefit over EXT4.
 
 As per a comment I read somewhere on the hbase wiki - XFS makes for 
 faster formatting of filesystems (it takes us 5.5 minutes to rebuild a 
 datanode from bare metal to a full Hadoop config on top of Debian 
 Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes).
 
 However, TeraSort performance on a cluster of 45 of these data-nodes 
 shows XFS is slower (same configuration settings on both installs other 
 than changed filesystem), specifically,
 
 mkfs.xfs -f -l size=64m DEV
 (mounted with noatime,nodiratime,logbufs=8)
 gives me a cluster which runs TeraSort in about 23 minutes
 
 mkfs.ext4 -T largefile4 DEV
 (mounted with noatime)
 gives me a cluster which runs TeraSort in about 18.5 minutes
 
 So I'll be rolling our cluster back to EXT4, but thought the information 
 might be useful/interesting to others.
 
 -stephen
 
 
 XFS config chosen from notes at 
 http://everything2.com/index.pl?node_id=1479435
 
 -- 
 Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
 NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
 http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com



Re: Hadoop performance - xfs and ext4

2010-05-11 Thread Scott Carey
Ah, one more thing.  With XFS there is an online defragmenter -- it runs every 
night on my cluster.  Performance on a fresh, empty system will not match a 
used one that has become fragmented.


On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote:

 Hi,
 
 I've been tweaking our cluster roll-out process to refine it. While 
 doing so, I decided to check if XFS gives any performance benefit over EXT4.
 
 As per a comment I read somewhere on the hbase wiki - XFS makes for 
 faster formatting of filesystems (it takes us 5.5 minutes to rebuild a 
 datanode from bare metal to a full Hadoop config on top of Debian 
 Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes).
 
 However, TeraSort performance on a cluster of 45 of these data-nodes 
 shows XFS is slower (same configuration settings on both installs other 
 than changed filesystem), specifically,
 
 mkfs.xfs -f -l size=64m DEV
 (mounted with noatime,nodiratime,logbufs=8)
 gives me a cluster which runs TeraSort in about 23 minutes
 
 mkfs.ext4 -T largefile4 DEV
 (mounted with noatime)
 gives me a cluster which runs TeraSort in about 18.5 minutes
 
 So I'll be rolling our cluster back to EXT4, but thought the information 
 might be useful/interesting to others.
 
 -stephen
 
 
 XFS config chosen from notes at 
 http://everything2.com/index.pl?node_id=1479435
 
 -- 
 Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
 NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
 http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com



Re: Extremely slow HDFS after upgrade

2010-04-17 Thread Scott Carey
All links check out as full duplex gigabit with ifconfig and ethtool.
Ifconfig reports no dropped packets or retransmits.  A tcpdump shows no 
retransmits, but shows significantly smaller packets and fewer outstanding 
packets between acks.

But iperf in UDP mode shows a consistent 1.5% lost datagram rate in one 
direction -- I can transmit 900Mbits/sec + with 1.5% loss by udp over the 
flawed links, 0% loss the other way.  So it appears that the Linux tcp flow 
control is throttling back the window size due to these losses.  Time to have 
someone replace all the network cables.

Thanks for the ideas Todd,

-Scott

On Apr 16, 2010, at 8:25 PM, Todd Lipcon wrote:

 Checked link autonegotiation with ethtool? Sometimes gige will autoneg to
 10mb half duplex if there's a bad cable, NIC, or switch port.

 -Todd

 On Fri, Apr 16, 2010 at 8:08 PM, Scott Carey sc...@richrelevance.comwrote:

 More info -- this is not a Hadoop issue.

 The network performance issue can be replicated with SSH only on the links
 where Hadoop has a problem, and only in the direction with a problem.

 HDFS is slow to transfer data in certain directions from certain machines.

 So, for example, copying from node C to D may be slow, but not the other
 direction from C to D.  Likewise, although only 3 of 8 nodes have this
 problem, it is not universal.  For example, node C might have trouble
 copying data to 5 of the 7 other nodes, and node G might have trouble with
 all 7 other nodes.

 No idea what it is yet, but SSH exhibits the same issue -- only in those
 specific point-to-point links in one specific direction.

 -Scott

 On Apr 16, 2010, at 7:10 PM, Scott Carey wrote:

 Ok, so here is a ... fun result.

 I have dfs.replication.min set to 2, so I can't just do
 hsdoop fs -Ddfs.replication=1 put someFile someFile
 Since that will fail.

 So here are two results that are fascinating:

 $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
 real1m53.237s
 user0m1.952s
 sys 0m0.308s

 $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
 real0m1.689s
 user0m1.763s
 sys 0m0.315s



 The file is 77MB and so is two blocks.
 The test with replication level 3 is slow about 9 out of 10 times.  When
 it is slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one
 time...
 The test with replication level 2 is fast in 40 out of 40 tests.

 This is a development cluster with 8 nodes.

 It looks like the replication level of 3 or more causes trouble.  Looking
 more closely at the logs, it seems that certain datanodes (but not all)
 cause large delays if they are in the middle of an HDFS write chain.  So, a
 write that goes from A  B  C is fast if B is a good node and C a bad node.
 If its A  C  B then its slow.

 So, I can say that some nodes but not all are doing something wrong. when
 in the middle of a write chain.  If I do a replication = 2 write on one of
 these bad nodes, its always slow.

 So the good news is I can identify the bad nodes, and decomission them.
 The bad news is this still doesn't make a lot of sense, and 40% of the
 nodes have the issue.  Worse, on a couple nodes the behavior in the
 replication = 2 case is not consistent -- sometimes the first block is fast.
 So it may be dependent on not just the source, but the source  target
 combination in the chain.


 At this point, I suspect something completely broken at the network
 level, perhaps even routing.  Why it would show up after an upgrade is yet
 to be determined, but the upgrade did include some config changes and OS
 updates.

 Thanks Todd!

 -Scott


 On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put
 with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.com
 wrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.
 I
 have looked at every difference I can find and am wondering where else
 to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on
 any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any
 machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64
 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think

Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
I have two clusters upgraded to CDH2.   One is performing fine, and the other 
is EXTREMELY slow.

Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

It is an HDFS issue from what I can tell.

The simple DFS benchmark with one map task shows the problem clearly.  I have 
looked at every difference I can find and am wondering where else to look to 
track this down.
The disks on all nodes in the cluster check out -- capable of 75MB/sec minimum 
with a 'dd' write test.
top / iostat do not show any significant CPU usage or iowait times on any 
machines in the cluster during the test.
ifconfig does not report any dropped packets or other errors on any machine in 
the cluster.  dmesg has nothing interesting.
The poorly performing cluster is on a slightly newer CentOS version:
Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 
x86_64 GNU/Linux  (CentOS 5.4, recent patches)
Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 
GNU/Linux  (CentOS 5.3, I think)
The performance is always poor, not sporadically poor.  It is poor with M/R 
tasks as well as non-M/R HDFS clients (i.e. sqoop).

Poor performance cluster (no other jobs active during the test):
---
$ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write 
-nrFiles 1 -fileSize 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 mega 
bytes, 1 files
10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 
files
10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
10/04/16 13:43:07 INFO mapred.JobClient: Job complete: job_201004091928_0391
[snip]
10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - : write
10/04/16 13:43:07 INFO mapred.FileInputFormat:Date  time: Fri Apr 
16 13:43:07 PDT 2010
10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1
10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 13:43:07 INFO mapred.FileInputFormat:  Throughput mb/sec: 
0.678296742615553
10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
0.6782967448234558
10/04/16 13:43:07 INFO mapred.FileInputFormat:  IO rate std deviation: 
9.568803140552889E-5
10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: 2992.913


Good performance cluster (other jobs active during the test):
-
hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write 
-nrFiles 1 -fileSize 2000
10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in 
the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
core-default.xml, mapred-default.xml and hdfs-default.xml respectively
TestFDSIO.0.0.4
10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1
10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000
10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 100
10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 mega 
bytes, 1 files
10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 
files
10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1
10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098
10/04/16 12:50:55 INFO mapred.JobClient:  map 0% reduce 0%
10/04/16 12:51:22 INFO mapred.JobClient:  map 100% reduce 0%
10/04/16 12:51:32 INFO mapred.JobClient:  map 100% reduce 100%
10/04/16 12:51:32 INFO mapred.JobClient: Job complete: job_201003311607_4098
[snip]
10/04/16 12:51:32 INFO mapred.FileInputFormat: - TestDFSIO - : write
10/04/16 12:51:32 INFO mapred.FileInputFormat:Date  time: Fri Apr 
16 12:51:32 PDT 2010
10/04/16 12:51:32 INFO mapred.FileInputFormat:Number of files: 1
10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000
10/04/16 12:51:32 INFO mapred.FileInputFormat:  Throughput mb/sec: 
92.47699634715865
10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
92.47699737548828

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
Ok, so here is a ... fun result.

I have dfs.replication.min set to 2, so I can't just do
hsdoop fs -Ddfs.replication=1 put someFile someFile
Since that will fail.

So here are two results that are fascinating:

$ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
real1m53.237s
user0m1.952s
sys 0m0.308s

$ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
real0m1.689s
user0m1.763s
sys 0m0.315s



The file is 77MB and so is two blocks.
The test with replication level 3 is slow about 9 out of 10 times.  When it is 
slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one time...
The test with replication level 2 is fast in 40 out of 40 tests.

This is a development cluster with 8 nodes.

It looks like the replication level of 3 or more causes trouble.  Looking more 
closely at the logs, it seems that certain datanodes (but not all) cause large 
delays if they are in the middle of an HDFS write chain.  So, a write that goes 
from A  B  C is fast if B is a good node and C a bad node.  If its A  C  B 
then its slow.

So, I can say that some nodes but not all are doing something wrong. when in 
the middle of a write chain.  If I do a replication = 2 write on one of these 
bad nodes, its always slow.

So the good news is I can identify the bad nodes, and decomission them.  The 
bad news is this still doesn't make a lot of sense, and 40% of the nodes have 
the issue.  Worse, on a couple nodes the behavior in the replication = 2 case 
is not consistent -- sometimes the first block is fast.  So it may be dependent 
on not just the source, but the source  target combination in the chain.


At this point, I suspect something completely broken at the network level, 
perhaps even routing.  Why it would show up after an upgrade is yet to be 
determined, but the upgrade did include some config changes and OS updates.

Thanks Todd!

-Scott


On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.  I
 have looked at every difference I can find and am wondering where else to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The performance is always poor, not sporadically poor.  It is poor with M/R
 tasks as well as non-M/R HDFS clients (i.e. sqoop).

 Poor performance cluster (no other jobs active during the test):
 ---
 $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process
 : 1
 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391
 10/04/16 12:53:16 INFO mapred.JobClient:  map 0% reduce 0%
 10/04/16 13:42:30 INFO mapred.JobClient:  map 100% reduce 0%
 10/04/16 13:43:06 INFO mapred.JobClient:  map 100% reduce 100%
 10/04/16 13:43:07 INFO mapred.JobClient: Job complete:
 job_201004091928_0391
 [snip]
 10/04/16 13:43:07 INFO mapred.FileInputFormat: - TestDFSIO - :
 write
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Date  time: Fri
 Apr 16 13:43:07 PDT 2010
 10/04/16 13:43:07 INFO mapred.FileInputFormat:Number of files: 1
 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000
 10/04/16 13:43:07 INFO mapred.FileInputFormat:  Throughput

Re: Extremely slow HDFS after upgrade

2010-04-16 Thread Scott Carey
More info -- this is not a Hadoop issue.

The network performance issue can be replicated with SSH only on the links 
where Hadoop has a problem, and only in the direction with a problem.

HDFS is slow to transfer data in certain directions from certain machines.

So, for example, copying from node C to D may be slow, but not the other 
direction from C to D.  Likewise, although only 3 of 8 nodes have this problem, 
it is not universal.  For example, node C might have trouble copying data to 5 
of the 7 other nodes, and node G might have trouble with all 7 other nodes.

No idea what it is yet, but SSH exhibits the same issue -- only in those 
specific point-to-point links in one specific direction.

-Scott

On Apr 16, 2010, at 7:10 PM, Scott Carey wrote:

 Ok, so here is a ... fun result.

 I have dfs.replication.min set to 2, so I can't just do
 hsdoop fs -Ddfs.replication=1 put someFile someFile
 Since that will fail.

 So here are two results that are fascinating:

 $ time hadoop fs -Ddfs.replication=3 -put test.tar test.tar
 real1m53.237s
 user0m1.952s
 sys 0m0.308s

 $ time hadoop fs -Ddfs.replication=2 -put test.tar test.tar
 real0m1.689s
 user0m1.763s
 sys 0m0.315s



 The file is 77MB and so is two blocks.
 The test with replication level 3 is slow about 9 out of 10 times.  When it 
 is slow it sometimes is 28 seconds, sometimes 2 minutes.  It was fast one 
 time...
 The test with replication level 2 is fast in 40 out of 40 tests.

 This is a development cluster with 8 nodes.

 It looks like the replication level of 3 or more causes trouble.  Looking 
 more closely at the logs, it seems that certain datanodes (but not all) cause 
 large delays if they are in the middle of an HDFS write chain.  So, a write 
 that goes from A  B  C is fast if B is a good node and C a bad node.  If 
 its A  C  B then its slow.

 So, I can say that some nodes but not all are doing something wrong. when in 
 the middle of a write chain.  If I do a replication = 2 write on one of these 
 bad nodes, its always slow.

 So the good news is I can identify the bad nodes, and decomission them.  The 
 bad news is this still doesn't make a lot of sense, and 40% of the nodes have 
 the issue.  Worse, on a couple nodes the behavior in the replication = 2 case 
 is not consistent -- sometimes the first block is fast.  So it may be 
 dependent on not just the source, but the source  target combination in the 
 chain.


 At this point, I suspect something completely broken at the network level, 
 perhaps even routing.  Why it would show up after an upgrade is yet to be 
 determined, but the upgrade did include some config changes and OS updates.

 Thanks Todd!

 -Scott


 On Apr 16, 2010, at 5:34 PM, Todd Lipcon wrote:

 Hey Scott,

 This is indeed really strange... if you do a straight hadoop fs -put with
 dfs.replication set to 1 from one of the DNs, does it upload slow? That
 would cut out the network from the equation.

 -Todd

 On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey sc...@richrelevance.comwrote:

 I have two clusters upgraded to CDH2.   One is performing fine, and the
 other is EXTREMELY slow.

 Some jobs that formerly took 90 seconds, take 20 to 50 minutes.

 It is an HDFS issue from what I can tell.

 The simple DFS benchmark with one map task shows the problem clearly.  I
 have looked at every difference I can find and am wondering where else to
 look to track this down.
 The disks on all nodes in the cluster check out -- capable of 75MB/sec
 minimum with a 'dd' write test.
 top / iostat do not show any significant CPU usage or iowait times on any
 machines in the cluster during the test.
 ifconfig does not report any dropped packets or other errors on any machine
 in the cluster.  dmesg has nothing interesting.
 The poorly performing cluster is on a slightly newer CentOS version:
 Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.4, recent patches)
 Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
 x86_64 GNU/Linux  (CentOS 5.3, I think)
 The performance is always poor, not sporadically poor.  It is poor with M/R
 tasks as well as non-M/R HDFS clients (i.e. sqoop).

 Poor performance cluster (no other jobs active during the test):
 ---
 $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write
 -nrFiles 1 -fileSize 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000
 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000
 mega bytes, 1 files
 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1
 files
 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 10/04/16 12:53:15

Re: swapping on hadoop

2010-04-02 Thread Scott Carey

On Apr 1, 2010, at 5:04 PM, Vasilis Liaskovitis wrote:
 
 
 ok. Now, considering a map side space buffer and a sort buffer, do
 both account for tenured space for both map and reduce JVMs? I 'd
 think the map side buffer gets used and tenured for map tasks and the
 sort space gets used and tenured for the reduce task during sort/merge
 phase. Would both spaces really be used in both kinds of tasks?
 

It is my understanding that a JVM used for a map won't also be used for a 
reduce.  The JVM reuse runs multiple maps or reduces in one process but not 
across both.
The mapper does the majority of the sorting, the reducer mostly merges 
pre-sorted data.  Each kind of task tends to have a different memory footprint, 
dependent on the job and data.

 The maximum number of map and reduce tasks per node applies no matter how 
 many jobs are running.
 
 RIght. But depending on your job scheduler, isn't it possible that you
 may be swapping the different jobs' JVM space in and out of physical
 memory while scheduling all the parallel jobs? Especially if nodes
 don't have huge amounts of memory, this scenario sounds likely.
 

To be more precise, the max number of map and reduce tasks corresponds with the 
maximum number of active JVMs of each type at the same time.  When a job 
finishes all of its tasks, the JVMs for it are killed.  A new job gets new 
JVMs.  Running concurrent jobs means that each job has some fraction of these 
JVM slots occupied.
So, there should be no swapping different Jobs JVMs in and out of RAM.  The 
same number of active JVM's exists for one large job as it does for 4 
concurrent jobs. 

 
 
 Back to a single job running and assuming all heap space being used,
 what percentage of a node's memory would you leave for other functions
 esp. disk cache? I currently only have 25% of memory (~4GB) for
 non-heapJVM data; I guess there should be a sweet-spot, probably
 dependent on the job I/O characteristics.
 

It will dependon the job, its I/O, and the OS tuning.  But 25% to 33% of memory 
for system file cache has worked for me (remember, the nodes aren't just for 
tasks, but also for HDFS).  A small amount of swap-out isn't bad, since the 
JVM's expire and never swap-in.


 - Vasilis



Re: OutOfMemoryError: Cannot create GC thread. Out of system resources

2010-04-01 Thread Scott Carey
The default size of Java's young GC generation is 1/3 of the heap.  
(-XX:NewRatio defaults to 2)
You have told it to use 100MB for in memory file system.  There is a default 
setting of 64MB sort space.   

if -Xmx is 128M then the above sums to over 200MB and won't fit.   Turning down 
the use of any of the three above could help, or increasing -Xmx.

Additionally, when a thread can't be allocated it could potentially be due to a 
limit on the OS side for file system handles per process or user.


On Mar 31, 2010, at 11:48 AM, Edson Ramiro wrote:

 Hi all,
 
 When I run the pi Hadoop sample I get this error:
 
 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp://
 h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stdout
 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp://
 h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stderr
 10/03/31 15:46:20 INFO mapred.JobClient: Task Id :
 attempt_201003311545_0001_m_06_1, Status : FAILED
 java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
 
 May be its because the datanode can't create more threads.
 
 ram...@lcpad:~/hadoop-0.20.2$ cat
 logs/userlogs/attempt_201003311457_0001_r_01_2/stdout
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 # java.lang.OutOfMemoryError: Cannot create GC thread. Out of system
 resources.
 #
 #  Internal Error (gcTaskThread.cpp:38), pid=28840, tid=140010745776400
 #  Error: Cannot create GC thread. Out of system resources.
 #
 # JRE version: 6.0_17-b04
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode
 linux-amd64 )
 # An error report file with more information is saved as:
 #
 /var-host/tmp/hadoop-ramiro/mapred/local/taskTracker/jobcache/job_201003311457_0001/attempt_201003311457_0001_r_01_2/work/hs_err_pid28840.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 #
 
 I configured the limits bellow, but I'm still getting the same error.
 
  property
  namefs.inmemory.size.mb/name
  value100/value
  /property
 
  property
  namemapred.child.java.opts/name
  value-Xmx128M/value
  /property
 
 Do you know what limit should I configure to fix it?
 
 Thanks in Advance
 
 Edson Ramiro



Re: swapping on hadoop

2010-04-01 Thread Scott Carey

On Apr 1, 2010, at 8:38 AM, Vasilis Liaskovitis wrote:

 
 In this example, what hadoop config parameters do the above 2 buffers
 refer to? io.sort.mb=250, but which parameter does the map side join
 100MB refer to? Are you referring to the split size of the input data
 handled by a single map task? Apart from that question, the example is
 clear to me and useful, thanks.
 

Map side join in just an example of one of many possible use cases where a 
particular map implementation may hold on to some semi-permanent data for the 
whole task.
It could be anything that takes 100MB of heap and holds the data across 
individual calls to map().

 
 Quoting Allen: Java takes more RAM than just the heap size.
 Sometimes 2-3x as much.
 Is there a clear indication that Java memory usage extends so far
 beyond its allocated heap? E.g. would java thread stacks really
 account for such a big increase 2x to 3x? Tasks seem to be heavily
 threaded. What are the relevant config options to control number of
 threads within a task?
 

Java typically uses 5MB to 60MB for classloader data (statics, classes) and 
some space for threads, etc.  The default thread stack on most OS's is about 
1MB, and the number of threads for a task process is on the order of a dozen.
Getting 2-3x the space in a java process outside the heap would require either 
a huge thread count, a large native library loaded, or perhaps a non-java 
hadoop job using pipes.
It would be rather obvious in 'top' if you sort by memory (shift-M on linux), 
or vmstat, etc.   To get the current size of the heap of a process, you can use 
jstat or 'kill -3' to create a stack dump and heap summary.

 
 With this new setup, I don't normally get swapping for a single job
 e.g. terasort or hive job. However, the problem in general is
 exacerbated if one spawns multiple indepenendent hadoop jobs
 simultaneously. I 've noticed that JVMs are not re-used across jobs,
 in an earlier post:
 http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html
 This implies that Java memory usage would blow up when submitting
 multiple independent jobs. So this multiple job scenario sounds more
 susceptible to swapping
 
The maximum number of map and reduce tasks per node applies no matter how many 
jobs are running.


 A relevant question is: in production environments, do people run jobs
 in parallel? Or is it that the majority of jobs is a serial pipeline /
 cascade of jobs being run back to back?
 
Jobs are absolutely run in parallel.  I recommend using the fair scheduler with 
no config parameters other than 'assignmultiple = true' as the 'baseline' 
scheduler, and adjust from there accordingly.  The Capacity Scheduler has more 
tuning knobs for dealing with memory constraints if jobs have drastically 
different memory needs.  The out-of-the-box FIFO scheduler tends to have a hard 
time keeping the cluster utilization high when there are multiple jobs to run.

 thanks,
 
 - Vasilis



Re: swapping on hadoop

2010-03-31 Thread Scott Carey
On Linux, check out the 'swappiness' OS tunable -- you can turn this down from 
the default to reduce swapping at the expense of some system file cache.  
However, you want a decent chunk of RAM left for the system to cache files -- 
if it is all allocated and used by Hadoop there will be extra I/O.

For Java GC, if your -Xmx is above 600MB or so, try either changing 
-XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the 
-XX:MaxNewSize parameter to around 150MB to 250MB.

An example of Hadoop memory use scaling as -Xmx grows:

Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop 
sort space.

Both of these chunks of data will eventually get pushed to the tenured 
generation.   So, the actual heap required will end up close to:

(Size of young generation) + 100MB + 250MB + misc.  
The default size of the young generation is 1/3 of the heap.  So, at -Xmx750M 
this job will probably use a minimum of 600MB of java heap, plus about 50MB 
non-heap if this is a pure java job.

Now, perhaps due to some other jobs you want to set -Xmx1200M.  The above job 
will end up using about 150MB more now, because the new space has grown, 
although the footprint is the same.   A larger new space can improve 
performance, but with most typical hadoop jobs it won't.   Making sure it does 
not grow larger just because -Xmx is larger can help save a lot of memory.  
Additionally, a job that would have failed with an OOME at -Xmx1200M might pass 
at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space.

If you are using a 64 bit JRE, you can also save space with the 
-XX:+UseCompressedOops option -- sometimes quite a bit of space.

On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote:

 Hi all,
 
 I 've noticed swapping for a single terasort job on a small 8-node
 cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
 can have back to back runs of the same job from the same hdfs input
 data and get swapping only on 1 out of 4 identical runs. I 've noticed
 this swapping behaviour on both terasort jobs and hive query jobs.
 
 - Focusing on a single job config, Is there a rule of thumb about how
 much node memory should be left for use outside of Child JVMs?
 I make sure that per Node, there is free memory:
 (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
 JVMHeapSize  PhysicalMemoryonNode
 The total JVM heap size per node per job from the above equation
 currently account 65%-75% of the node's memory. (I 've tried
 allocating a riskier 90% of the node's memory, with similar swapping
 observations).
 
 - Could there be an issue with HDFS data or metadata taking up memory?
 I am not cleaning output or intermediate outputs from HDFS between
 runs. Is this possible?
 
 - Do people use any specific java flags (particularly garbage
 collection flags) for production environments where one job runs (or
 possibly more jobs run simultaneously) ?
 
 - What are the memory requirements for the jobtracker,namenode and
 tasktracker,datanode JVMs?
 
 - I am setting io.sort.mb to about half of the JVM heap size (half of
 -Xmx in javaopts). Should this be set to a different ratio? (this
 setting doesn't sound like it should be causing swapping in the first
 place).
 
 - The buffer cache is cleaned before each run (flush and echo 3 
 /proc/sys/vm/drop_caches)
 
 any empirical advice and suggestions  to solve this are appreciated.
 thanks,
 
 - Vasilis



Re: where does jobtracker get the IP and port of namenode?

2010-03-09 Thread Scott Carey

On Mar 8, 2010, at 11:38 PM, jiang licht wrote:

 I guess my confusion is this:
 
 I point fs.default.name to hdfs:A:50001 in core-site.xml (A is IP address). 
 I assume when tasktracker starts, it should use A:50001 to contact namenode. 
 But actually, tasktracker log shows that it uses B which is IP address of 
 another network interface of the  namenode box and because the tasktracker 
 box cannot reach address B, the tasktracker simply retries connection and 
 finally fails to start. I read some source code in 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize and it seems to me 
 the namenode address is passed in earlier from what is specified in 
 fs.default.name. Is this correct that the namenode address used here by 
 tasktracker comes from fs.default.name in core-site.xml or somehow there is 
 another step in which this value is changed? Could someone elaborate this 
 process how tasktracker resolves namenode and contacts it? Thanks!
 

Hadoop is rather annoyingly strict on how dns and reverse dns are aligned.  I'm 
not sure if it applies to your specific problem, but:
Even if configured to talk to A, if A is an IP address, in some places it will 
reverse-dns that IP, then dns resolve the resolved name.

So if IP A maps by reverse dns (via dns or a hosts file or whatever) to name 
FOO, and FOO resolves to IP address B, then that is likely your problem.
datanodes/namenodes with multiple ip addresses often have problems like this.  
I wish that if you configured it to 'talk to IP address A' all it did was try 
and talk to IP address A, but thats not how it works.
I'm used to seeing this as a datanode network configuration problem, not a 
namenode problem.  But you mention that the server has more than one network 
interface, so it may be related.


 Thanks,
 
 Michael
 
 --- On Tue, 3/9/10, jiang licht licht_ji...@yahoo.com wrote:
 
 From: jiang licht licht_ji...@yahoo.com
 Subject: Re: where does jobtracker get the IP and port of namenode?
 To: common-user@hadoop.apache.org
 Date: Tuesday, March 9, 2010, 12:20 AM
 
 Sorry, that was a typo in my first post. I did use 'fs.default.name' in 
 core-site.xml.
 
 BTW, the following is the list of error message when tasktracker was started 
 and shows that tasktracker failed to connect to namenode A:50001.
 
 /
 STARTUP_MSG: Starting TaskTracker
 STARTUP_MSG:   host = HOSTNAME/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.1+169.56
 STARTUP_MSG:   build =  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; compiled 
 by 'root' on Tue Feb  9 13:40:08 EST 2010
 /
 2010-03-09 00:08:50,199 INFO org.mortbay.log: Logging to 
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via 
 org.mortbay.log.Slf4jLog
 2010-03-09 00:08:50,341 INFO org.apache.hadoop.http.HttpServer: Port returned 
 by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening 
 the listener on 50060
 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: 
 listener.getLocalPort() returned 50060 
 webServer.getConnectors()[0].getLocalPort() returned 50060
 2010-03-09 00:08:50,350 INFO org.apache.hadoop.http.HttpServer: Jetty bound 
 to port 50060
 2010-03-09 00:08:50,350 INFO org.mortbay.log: jetty-6.1.14
 2010-03-09 00:08:50,707 INFO org.mortbay.log: Started 
 selectchannelconnec...@0.0.0.0:50060
 2010-03-09 00:08:50,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
 Initializing JVM Metrics with processName=TaskTracker, sessionId=
 2010-03-09 00:08:50,749 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
 Initializing RPC Metrics with hostName=TaskTracker, port=52550
 2010-03-09 00:08:50,799 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 52550: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 0 on 52550: starting
 2010-03-09 00:08:50,800 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 1 on 52550: starting
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 2 on 52550: starting
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker: 
 TaskTracker up at: HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,801 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
 tracker tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,802 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 3 on 52550: starting
 2010-03-09 00:08:50,854 INFO org.apache.hadoop.mapred.TaskTracker:  Using 
 MemoryCalculatorPlugin : 
 org.apache.hadoop.util.linuxmemorycalculatorplu...@27b4c1d7
 2010-03-09 00:08:50,856 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
 thread: Map-events fetcher for all reduce tasks on 
 tracker_HOSTNAME:HOSTNAME/127.0.0.1:52550
 2010-03-09 00:08:50,858 WARN org.apache.hadoop.mapred.TaskTracker: 
 TaskTracker's 

Re: Pipelining data from map to reduce

2010-03-04 Thread Scott Carey
Interesting article.  It claims to have the same fault tolerance but I don't 
see any explanation of how that can be.  

If a single mapper fails part-way through a task when it has transmitted 
partial results to a reducer, the whole job is corrupted.  With the current 
barrier between map and reduce, a job can recover from partially completed 
tasks and speculatively execute.

I would imagine that small low latency tasks can benefit greatly from such an 
approach, but larger tasks need the barrier or will not be very fault tolerant. 
 However, there is still a lot of optimizations to dot in Hadoop for low 
latency tasks while maintaining the barrier.


On Mar 4, 2010, at 2:18 PM, Jeff Hammerbacher wrote:

 Also see Breaking the MapReduce Stage Barrier from UIUC:
 http://www.ideals.illinois.edu/bitstream/handle/2142/14819/breaking.pdf
 
 On Thu, Mar 4, 2010 at 11:41 AM, Ashutosh Chauhan 
 ashutosh.chau...@gmail.com wrote:
 
 Bharath,
 
 This idea is  kicking around in academia.. not made into apache yet..
 https://issues.apache.org/jira/browse/MAPREDUCE-1211
 
 You can get a working prototype from:
 http://code.google.com/p/hop/
 
 Ashutosh
 
 On Thu, Mar 4, 2010 at 09:06, E. Sammer e...@lifeless.net wrote:
 On 3/4/10 12:00 PM, bharath v wrote:
 
 Hi ,
 
 Can we pipeline the map output directly into reduce phase without
 storing it in the local filesystem (avoiding disk IOs).
 If yes , how to do that ?
 
 Bharath:
 
 No, there's no way to avoid going to disk after the mappers.
 
 --
 Eric Sammer
 e...@lifeless.net
 http://esammer.blogspot.com
 
 



Re: Sun JVM 1.6.0u18

2010-03-01 Thread Scott Carey

On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote:

 
 
 
 On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 u14 added support for the 64bit compressed memory pointers which
 seemed important due to the fact that hadoop can be memory hungry. u15
 has been stable in our deployments. Not saying you should not go
 newer, but I would not go older then u14.
 
 How are the compressed memory pointers working for you?  I've been debating
 turning them on here, so real world experience would be useful from those
 that have taken plunge.
 

Been using it since they came out, both for Hadoop where needed and in many 
other applications.  Performance gains and memory reduction in most places -- 
sometimes rather significant (25%).  GC times significantly lower for any heap 
that is reference heavy.  Heaps are still a little larger than a 32 bit one, 
but the benefits of native 64 bit code on x86 include improved computational 
performance as well.  6u18 introduces some performance enhancements to the 
feature that we might be able to use if 6u19 fixes the other bugs.  The next 
Hotspot version will make it the default setting, whenever that gets integrated 
and tested into the JDK6 line.  6u14 and 6u18 are the last two JDK releases 
with updated Hotspot versions.

Re: Sun JVM 1.6.0u18

2010-02-25 Thread Scott Carey
On Feb 15, 2010, at 9:54 PM, Todd Lipcon wrote:

 Hey all,
 
 Just a note that you should avoid upgrading your clusters to 1.6.0u18.
 We've seen a lot of segfaults or bus errors on the DN when running
 with this JVM - Stack found the ame thing on one of his clusters as
 well.
 

Have you seen this for 32bit, 64 bit, or both?  If 64 bit, was it with 
-XX:+UseCompressedOops?

Any idea if there are Sun bugs open for the crashes?

I have found some notes that suggest that -XX:-ReduceInitialCardMarks will 
work around some known crash problems with 6u18, but that may be unrelated.  

Lastly, I assume that Java 6u17 should work the same as 6u16, since it is a 
minor patch over 6u16 where 6u18 includes a new version of Hotspot.  Can anyone 
confirm that?


 We've found 1.6.0u16 to be very stable.
 
 -Todd



Re: hadoop idle time on terasort

2009-12-09 Thread Scott Carey



On 12/8/09 1:24 PM, Vasilis Liaskovitis vlias...@gmail.com wrote:

 Hi Scott,
 
 thanks for the extra tips, these are very helpful.
 
 
 
 I think the slots are being highly utilized, but I seem to have
 forgotten which option in the web UI allows you to look at the slot
 allocations during runtime on each tasktracker. Is this available on
 the job details from the jobtracker's ui or somewhere else?
 

I usually just look at the jobtracker's main page, that lists the total
number of slots for the cluster, and how many are in use.
Maps, Reduces, Map Task Capacity, Reduce Task Capacity
I usually don't drill down to a per trasktracker basis.

I also look at the running jobs on the same main page, and see the
allocation of tasks between them.  Clicking down to an individual job lets
you see the tasks of each job and view their logs.  That is usually
sufficient (combined with top, mpstat, iostat, etc) to see how the
scheduling is working and correlate tasks or phases of hadoop with certain
system behavior.

 Running the fair scheduler -- or any scheduler -- that can be configured to
 schedule more than one task per heartbeat can dramatically increase your
 slot utilization if it is low.
 
 I 've been running a single job - would  the scheduler benefits show
 up with multiple jobs only, by definition? I am now trying multiple
 simultaneous sort jobs with smaller disjoint datasets launched in
 parallel by the same user. Do I need to setup any fairscheduler
 parameters other than the below?

It can help with only one job as well, due to the 'assignmultiple' feature.
Often, the default scheduler will assign one task per tasktracker ping.
This is not enough to keep up, or to ramp up at the start of the job
quickly.
This is something you can see in the UI.  If jobs are completing faster than
the scheduler can assign new tasks the slot utilization will be low.

 
 property
   namemapred.jobtracker.taskScheduler/name
   valueorg.apache.hadoop.mapred.FairScheduler/value
 /property
 
 property
 namemapred.fairscheduler.assignmultiple/name
 valuetrue/value
 /property
 
 property
   namemapred.fairscheduler.allocation.file/name
   value/home/user2/hadoop-0.20.1/conf/fair-scheduler.xml/value
 /property
 
 fairscheduler.xml is:
 
 ?xml version=1.0?
 allocations
  pool name=user2
minMaps12/minMaps
minReduces12/minReduces
maxRunningJobs16/maxRunningJobs
weight4/weight
  /pool
  userMaxJobsDefault16/userMaxJobsDefault
 /allocations
 
 With 4 parallel sort jobs, I am noticing that maps execute in parallel
 across all jobs.  But reduce tasks are only allocated/executed from a
 single job at a time, until that job finishes. Is that expected or am
 I missing something in my fairscheduler (or other) settings?
 

The config seems fine.

The reduce tasks all belonging to one job seems a bit odd to me, usually
reduce tasks are distributed amongst jobs as well as map tasks.   However,
there is another parameter that defines how far along the map phase of a job
has to be before it begins scheduling reduces.  That, combined with how many
total reduce slots you have on your cluster and how many each job is trying
to create, might create the behavior you are seeing.

I don't think I set the minMaps or minReduces and don't know the defaults.
Maybe that is affecting it?


 See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my
 comment from June 10 2009.
 
 
 For a single big sort job, I have ~2300maps and 84 reduces on a 7node
 cluster with 12-core nodes. The thread dumps for my unpatched version
 also show sleeping threads at fetchOutputs() - I don't know how often
 you 've seen it in your own task dumps.  From what I understand, what
 we are looking for in the reduce logs, in terms of a shuffle idle
 bottleneck, is the time elapsed between the shuffling lines to the
 in-memory merge complete lines, does that sound right? With the one
 line change in fetchOutput(), I 've sometimes seen the average shuffle
 time across tasks go down by ~5-10% and so has the execution time. But
 the results are variable across runs, I need to look at the reduce
 logs and repeat the experiment.

For something like sort, which does not reduce the size of the data prior to
a reduce in a combiner or other filter, I would not expect the same massive
gains that I did in that ticket.
I built a close to worse case example, where each reducer had to fetch 20+
map fragments from each node where the fetch time for each chunk is a few
milliseconds. In addition, the remainder of the reduce work is trivial, in
your case there is a lot of merging and spilling to do.  This will make the
logs harder to interpret -- there's more going on.

In your 7 node case, with 2300 maps, the default behavior would cause a
delay of a couple seconds * (2300/7). Is that roughly the 10% you see?

 
 From your  system description ticket-318 notes, I see you have
 configured your cluster to do 10 concurrent shuffle copies.  Does
 this refer to  

Re: hadoop idle time on terasort

2009-12-07 Thread Scott Carey

On 12/2/09 12:22 PM, Vasilis Liaskovitis vlias...@gmail.com wrote:

 Hi,
 
 I am using hadoop-0.20.1 to run terasort and randsort benchmarking
 tests on a small 8-node linux cluster. Most runs consist of usually
 low (50%) core utilizations in the map and reduce phase, as well as
 heavy I/O phases . There is usually a large fraction of runtime for
 which cores are idling and i/o disk traffic is not heavy.
 
 On average for the duration of a terasort run I get 20-30% cpu
 utilization, 10-30% iowait times and the rest 40-70% is idle time.
 This is data collected with mpstat for the duration of the run across
 the cores of a specific node. This utilization behaviour is true and
 symmetric for all tasktracker/data nodes (The namenode cores and I/O
 are mostly idle, so there doesn¹t seem to be a bottleneck in the
 namenode).
 

Look at individual task logs for map and reduce through the UI.  Also, look
at the cluster utilization during a run -- are most map and reduce slots
full most of the time, or is the slot utilization low?

Running the fair scheduler -- or any scheduler -- that can be configured to
schedule more than one task per heartbeat can dramatically increase your
slot utilization if it is low.

Next, if you find that your delay times correspond with the shuffle phase
(look in the reduce logs), there are fixes in 0.21 for that on the way, but
there is a quick win, one line change that cuts shuffle times down a lot on
clusters that have a large ratio of map tasks per node if the map output is
not too large.  For a pure sort test, the map outputs are medium sized (the
same size as the input), so this might not help.  But the indicators of the
issue are in the reduce task logs.
See this ticket: http://issues.apache.org/jira/browse/MAPREDUCE-318 and my
comment from June 10 2009.

My summary is that the hadoop scheduling process has not been so far for
servers that can run more than 6 or so tasks at once.  A server capable of
running 12 maps is especially prone to running under-utilized.  Many changes
in the 0.21 timeframe address some of this.

 
 thanks,
 
 - Vasilis
 



Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-15 Thread Scott Carey
What Hadoop version?

On a clusster this size there are two things to check right away:

1.  In the Hadoop UI, during the job, are the reduce and map slots close to
being filled up  most of the time, or are tasks completing faster than the
scheduler can keep up so that there are often many empty slots?

For 0.19.x and 0.20.x on a small cluster like this, use the Fair Scheduler
and make sure the configuration parameter that allows it to schedule more
than one task per heartbeat is on (at least one map and one reduce per,
which is supported in 0.19.x).  This alone will cut times down if the number
of map and reduce tasks is at least 2x the number of nodes.


2. Your CPU, Disk and Network aren't saturated -- take a look at the logs of
the reduce tasks and look for long delays in the shuffle.
Utilization is throttled by a bug in the reducer shuffle phase, not fixed
until 0.21.  Simply put, a single reduce task won't fetch more than one map
output from another node every 2 seconds (though it can fetch from multiple
nodes at once).  Fix this by commenting out one line in 0.18.x, 0.19.x or
0.20.x -- see my comment here:
http://issues.apache.org/jira/browse/MAPREDUCE-318
From June 10 2009.

I saw shuffle times on small clusters with large map task count per node
ratio decrease by a factor of 30 from that one line fix.  It was the only
way to get the network to ever be close to saturation on any node.

The delays for low latency jobs on smaller clusters are predominantly
artificial due to the nature of most RPC being ping-response and most design
and testing done for large clusters of machines that only run a couple maps
or reduces per TaskTracker.


Do the above, and you won't be nearly as sensitive to the size of data per
task for low latency jobs as out-of-the-box Hadoop.  Your overall
utilization will go up quite a bit.

-Scott


On 10/14/09 7:31 AM, Chris Seline ch...@searchles.com wrote:

 No, there doesn't seem to be all that much network traffic. Most of the
 time traffic (measured with nethogs) is about 15-30K/s on the master and
 slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
 5-10 seconds on a query that takes 10 minutes, but that is still less
 than what I see in scp transfers on EC2, which is typically about 30 MB/s.
 
 thanks
 
 Chris
 
 Jason Venner wrote:
 are your network interface or the namenode/jobtracker/datanodes saturated
 
 
 On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com wrote:
 
  
 I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11
 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
 available with 20 compute units and 4x 400gb disks.
 
 I wrote some scripts to test many (100's) of configurations running a
 particular Hive query to try to make it as fast as possible, but no matter
 what I don't seem to be able to get above roughly 45% cpu utilization on the
 slaves, and not more than about 1.5% wait state. I have also measured
 network traffic and there don't seem to be bottlenecks there at all.
 
 Here are some typical CPU utilization lines from top on a slave when
 running a query:
 Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
  0.7%st
 Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
  0.5%st
 Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
  1.0%st
 Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
  0.7%st
 Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
  1.3%st
 
 It seems like if tuned properly, I should be able to max out my cpu (or my
 disk) and get roughly twice the performance I am seeing now. None of the
 parameters I am tuning seem to be able to achieve this. Adjusting
 mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the
 io.file.buffer.size to 4096 does better than the default, but the rest of
 the values I am testing seem to have little positive  effect.
 
 These are the parameters I am testing, and the values tried:
 
 io.sort.factor=2,3,4,5,10,15,20,25,30,50,100
 
 mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
 0,0.93,0.95,0.97,0.98,0.99
 io.bytes.per.checksum=256,512,1024,2048,4192
 mapred.output.compress=true,false
 hive.exec.compress.intermediate=true,false
 
 hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
 0,0.93,0.95,0.97,0.98,0.99
 mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200
 
 mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m
 ,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
 mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
 mapred.merge.recordsBeforeProgress=5000,1,2,3
 
 mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0
 .80,0.90,0.93,0.95,0.99
 
 io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95
 ,0.99
 

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
If it always takes a very long time to start transferring data, get a few
stack dumps (jstack or kill -e) during this period to see what it is doing
during this time.

Most likely, the client is doing nothing but waiting on the remote side.


On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote:

 it's not really 1 mbps so much it takes 2 minutes to start doing the
 reads.
 
 Ananth T Sarathy
 
 
 On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.comwrote:
 
 
 On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
 
 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.
 
 
 Yes, there is nothing about distributing work for parallel execution that
 is
 going to make a single 20MB file transfer faster.   That is very slow, and
 should be on the order of a second or so, not multiple minutes.
  Something else is wrong.
 
 
 
 



Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey

On 8/20/09 9:48 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote:

 ok.. i seems that's the case.  that seems kind of  selfdefeating though.
 
 Ananth T Sarathy

Then something is wrong with S3.  It may be misconfigured, or just poor
performance.  I have no experience with S3 but 20 seconds to connect
(authenticate?) and open a file seems very slow for any file system.

 
 
 On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey sc...@richrelevance.comwrote:
 
 If it always takes a very long time to start transferring data, get a few
 stack dumps (jstack or kill -e) during this period to see what it is doing
 during this time.
 
 Most likely, the client is doing nothing but waiting on the remote side.
 
 
 On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com
 wrote:
 
 it's not really 1 mbps so much it takes 2 minutes to start doing the
 reads.
 
 Ananth T Sarathy
 
 
 On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.com
 wrote:
 
 
 On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
 
 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.
 
 
 Yes, there is nothing about distributing work for parallel execution
 that
 is
 going to make a single 20MB file transfer faster.   That is very slow,
 and
 should be on the order of a second or so, not multiple minutes.
  Something else is wrong.
 
 
 
 
 
 
 



Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Scott Carey


On 8/20/09 3:40 AM, Steve Loughran ste...@apache.org wrote:

 
 
 does anyone have any up to date data on the memory consumption per
 block/file on the NN on a 64-bit JVM with compressed pointers?
 
 The best documentation on consumption is
 http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wondering if
 anyone has looked at the memory footprint on the latest Hadoop releases,
 on those latest JVMs? -and which JVM the numbers from HADOOP-1687 came from?
 
 Those compressed pointers (which BEA JRockit had for a while) save RAM
 when the pointer references are within a couple of GB of the other refs,
 and which are discussed in some papers
 http://rappist.elis.ugent.be/~leeckhou/papers/cgo06.pdf
 http://www.elis.ugent.be/~kvenster/papers/VenstermansKris_ORA.pdf
 
 sun's commentary is up here
 http://wikis.sun.com/display/HotSpotInternals/CompressedOops
 
 I'm just not sure what it means for the NameNode, and as there is no
 sizeof() operator in Java, something that will take a bit of effort to
 work out. From what I read of the Sun wiki, when you go compressed,
 while your heap is 3-4GB, there is no decompress operation; once you go
 above that there is a shift and an add, which is probably faster than
 fetching another 32 bits from $L2 or main RAM. The result could be
 -could be- that your NN takes up much less space on 64 bit JVMs than it
 did before, but is no slower.

The implementation in JRE 6u14 uses a shift for all heap sizes, the
optimization to remove that for heaps less than 4GB is not in the hotspot
version there (but will be later).

The size advantage is there either way.

I have not tested an app myself that was not faster using
-XX:+UseCompressedOops on a 64 bit JVM.
The extra bit shifting is overshadowed by how much faster and less frequent
GC is with a smaller dataset.


 
 Has anyone worked out the numbers yet?
 
 -steve
 

Every Java reference is 4 bytes instead of 8, and for several types --
arrays in particular -- the object is also 4 bytes smaller.  Given that the
NN data structures have plenty of references, a 30% reduction in memory used
would not be a surprise.

Collection classes in particular are near half the size.



Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Scott Carey

On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:

 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.


Yes, there is nothing about distributing work for parallel execution that is
going to make a single 20MB file transfer faster.   That is very slow, and
should be on the order of a second or so, not multiple minutes.
 Something else is wrong.




Re: What OS?

2009-08-14 Thread Scott Carey


On 8/14/09 6:27 AM, Brian Bockelman bbock...@cse.unl.edu wrote:

 
 On Aug 13, 2009, at 10:27 PM, Jason Venner wrote:
 
 Anyone have any performance numbers for Solaris or ZFS based
 datanodes.
 
 The directory and inode cache sizes are a limiting factor for linux
 for
 large and busy datanodes.
 
 I haven't run into this at all, and we have quite large and busy
 datanodes.
 
 However, I would recommend making sure you pick an OS you are
 comfortable administrating.  It doesn't do you any good to run Solaris
 due to speed (whatever the performance may be, better or worse) if it
 takes you twice as long to get basic admin tasks done.
 
 I haven't benchmarked our Solaris nodes vs Linux nodes.  However,
 anecdotally, HDFS on Solaris/ZFS consumes significantly more CPU than
 HDFS on Linux/ext3.
 
 Brian

I wonder if the extra CPU has anything to do with the ZFS checksums.
Perhaps it is lower with ZFS checksums off?  Since HDFS is already doing
checksums on the data that should be safe.

On the other hand, with ZFS you can get transparent, very fast compression
for free.

ext3 tends to get very fragmented very fast if there are concurrent writes.
XFS avoids that but only if you set the allocsize mount parameter large
enough.  In theory, ZFS should avoid fragmentation fairly well for
write-once data like HDFS but I have no experience with that in practice.


 
 
 On Wed, Aug 12, 2009 at 7:45 AM, tim robertson timrobertson...@gmail.com
 wrote:
 
 Thanks guys.  I'll chat with sys admin and see what he thinks.
 We knew fedora would require a 6 month rebuild
 
 
 On Wed, Aug 12, 2009 at 4:36 PM, Edward Caprioloedlinuxg...@gmail.com
 
 wrote:
 On Wed, Aug 12, 2009 at 8:03 AM, Brian Bockelmanbbock...@cse.unl.edu
 
 wrote:
 Hey Tim,
 
 One consideration is how long is this OS version going to be
 receiving
 updates? or Do I do the operations team any favor by having them
 upgrade
 every 6 months?
 
 Personally, I'd avoid Fedora for a production cluster because the
 lack
 of
 long-lived releases means that you'll be spending extra effort on
 upgrading
 the OS.
 
 Brian
 
 On Aug 12, 2009, at 6:05 AM, tim robertson wrote:
 
 Hi all,
 
 Is fedora a decent choice of OS for a new hadoop cluster?  All our
 other stuff is fedora, but is there was a strong case to move to
 something else?
 
 Cheers
 
 Tim
 
 
 
 CentOS and Scientific Linux are Red Hat Enterprise Linux clones. I
 advice people to go with them. Most of this is based on the fact
 that
 CentOS is very compatible with RHEL. This is important because
 packaged, but not open source software, is typically targeted at
 RHEL.
 You can read about someone trying to install WebSphere on say Fedora
 Core and see the hard aches. As mentioned above support life is an
 issue. RHEL/CENT 5 will be supported until 2014.
 
 http://www.redhat.com/security/updates/errata/
 
 The Fedora line typically has support life of a few months. So your
 package support dries up fast and then you have to get good with
 RPM-build fast :)
 
 
 
 
 
 -- 
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals
 



Re: Map performance with custom binary format

2009-07-31 Thread Scott Carey



On 7/30/09 2:32 PM, william kinney william.kin...@gmail.com wrote:

 Hrm I think I found an issue. In my RecordReader, I do an
 Arrays.copyOfRange() to get the protobuf binary section for
 RecordReader.next(key,value). In the profile dumps from child map
 processes, this one call takes up ~90% of the CPU Samples time.
 So, I wrapped the line w/ a System.nanoTime(), and got:

 Local process:
   total(ms): 902.138069, avg: 112.145 ns

 Hadoop Child processes:
   1) total(ms): 6953.47106, avg: 726.906 ns
   2) total(ms): 6503.962176, avg: 802.270 ns
   3) total(ms): 5482.494256, avg: 677.589 ns
   4) total(ms): 5291.664592, avg: 661.764 ns
   5) total(ms): 5568.289465, avg: 697.353 ns
   6) total(ms): 5638.778551, avg: 702.290 ns
...etc

 So for some reason, that call is taking over 6 times longer in hadoop...

 The buffer size for it is 65536 for both processes.

 Any ideas?


That is a very interesting result.  Try counting the number of times that
the above is called to ensure that is the same for both -- if the average
size of the copy is much smaller it will be slower.

Other ideas -- is one using a native ByteBuffer underneath the covers
somewhere and the other not?   Is there some other difference in buffering
on either side of that copy?




 --
 You might want to try -Xmx486m as an experiment on the local test to see if
 it affects the behavior.  If you are doing a lot of garbage creation it may.
   -Tried it, no changes.

 Hmm, that was a random guess, because it would obviously affect CPU use.
 Another thing to try -- make sure your writer that is writing into HDFS is
 wrapped with a buffer (try 32k or 64k).  That's another random guess for
 something that might not show up well in stack traces without a profiler --
 but also might already be done
   - So you're saying when writing the file into the HDFS, I should
 make sure it ends in 64k chunk (ie, zero-out until i reach such a
 point)? So all file sizes are a multiple of 64kb?

No, just that its using something like a BufferedOutputStream when writing
from your custom format out (HDFS does this itself so it shouldn't be
necessary) and BufferedInputStream for reading.


 There is definitely a mystery here.  I expect the task scheduling delays and
 some startup inefficiency but the overall difference is odd.  What about a
 local test on a single, larger file versus a hadoop job on that same single,
 larger file (which would have just one map job)?  This test may be very
 enlightening.
   - Total job time was 20 seconds for the 506MB file. Task took 19
 seconds. Local process on the same file took ~ 3 seconds.

Ok, so drilling down here is where we need to look  (and what the results
above are).  Scheduling may be a few seconds of that.

 Hmm that difference seems a bit troubling.  For one, you are running two
 tasks at once per node -- is there any way to do your local, non MR test
 with two concurrent processes or threads?
   - Does the above test answer this? Only one task was executed on the
 node that took 19 seconds.

Yeah, it looks like we have ruled that out.

 ext3 filesystem.

 Make sure your OS readahead on the device is set to a good value (at least
 2048 blocks, preferably 8192 ish):
   - For RA its showing 256, BSZ is 2048. RA should be 8192 ? Should
 BSZ then be larger? What about SSZ?

I'm referring to /sbin/blockdev --getra  device
Which is just RA.  SSZ is sector size -- that can't change, and I think
BSZ is block size, and is also static.

Use
/sbin/blockdev/ --setra value device to set the readahead.  This will
increase sequential throughput somewhat at the device level, but moreso if
there are two or more concurrent reads.   It doesn't affect random I/O
performance.  Basically, if the block layer detects a sequence of I/O's that
are sequential, it starts reading ahead of the last I/O and keeps
increasing the size of this readahead as long as the sequential access
continues, up to a max size.


 Since the performance is high for the local process, would that then
 mean my disk i/o is sufficient, as you suggested? Do I still need to
 change any of these settings?

You seem CPU bound, especially considering your evidence above.  I/O tuning
might help somewhere, but not this use case.


 If even a single task on a single large file is slower in MB/sec than your
 test program, then I suspect read/write buffer issues or misuse somewhere.
   - Do you know of an instance where I'd have buffer issues with the
 Child process, and not local? The only difference I can think of is of
 course how the buffer is filled, FileInputStream vs FSDataInputStream.
 But once it is filled, why would reading portions of that buffer (ie,
 Arrays.copyOfRange()) take long in one instance but not another?

I am not familiar enough with that part of Hadoop to know.  In general, that
buffer may be too small, or be backed by a Native ByteBuffer which will be
slow for small reads into Java memory.


 Would it be helpful to get a histogram of the 

Re: Map performance with custom binary format

2009-07-28 Thread Scott Carey
Well, the first thing to do in any performance bottleneck investigation is
to look at the machine hardware resource usage.

During your test, what is the CPU use and disk usage?  What about network
utilization?
Top, vmstat, iostat, and some network usage monitoring would be useful.  It
could be many things causing your lack of scalability, but without actually
monitoring your machines to see if there is an obvious bottleneck its just
random guessing and hunches.



On 7/28/09 8:18 AM, william kinney william.kin...@gmail.com wrote:

 Hi,
 
 Thanks in advance for the help!
 
 I have a performance question relating to how fast I can expect Hadoop
 to scale. Running Cloudera's 0.18.3-10.
 
 I have custom binary format, which is just Google Protocol Buffer
 (protobuf) serialized data:
 
   669 files, ~30GB total size (ranging 10MB to 100MB each).
   128MB block size.
   10 Hadoop Nodes.
 
 I tested my InputFormat and RecordReader for my input format, and it
 showed about 56MB/s performance (single thread, no hadoop, passed in
 test file via FileInputFormat instead of FSDataInputStream) on
 hardware similar to what I have in my cluster.
 I also then tested some simple Map logic along w/ the above, and got
 around 54MB/s. I believe that difference can be accounted for parsing
 the protobuf data into java objects.
 
 Anyways, when I put this logic into a job that has
   - no reduce (.setNumReduceTasks(0);)
   - no emit
   - just protobuf parsing calls (like above)
 
 I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.
 
 So my question, why is the rate only 2x what I see on a single thread,
 non-hadoop test? Would it not be:
   54MB/s x 10 (Num Nodes) - small hadoop overhead ?
 
 Is there any area of my configuration I should look into for tuning?
 
 Anyway I could get more accurate performance monitoring of my job?
 
 On a side note, I tried the same job after combining the files into
 about 11 files (still 30GB in size), and actually saw a decrease in
 performance (~90MB/s).
 
 Any help is appreciated. Thanks!
 
 Will
 
 some hadoop-site.xml values:
 dfs.replication  3
 io.file.buffer.size   65536
 dfs.datanode.handler.count  3
 mapred.tasktracker.map.tasks.maximum  6
 dfs.namenode.handler.count  5
 



Re: Disk configuration.

2009-07-13 Thread Scott Carey
For both the DN and TT you can provide a comma separated list of directories.

So, drive 1 could be /hadoop1
And drive 2 /hadoop2

Then in each of those there could be a dfs directory and another for task temp 
storage.

Hadoop will round-robin writes to these automatically.

Dfs.data.dir might look something like:
property
  namedfs.data.dir/name
  value/hadoop1/dfs/data,/hadoop2/dfs/data/value
  descriptionDetermines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  /description
/property

And the local mapreduce dir might look like:
property
  namemapred.local.dir/name
  value/hadoop1/tmp,/hadoop2/tmp/value
  descriptionThe local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  /description
/property


On 7/13/09 11:50 AM, Dmitry Pushkarev u...@stanford.edu wrote:

Hi.



We're running a small 30 node cluster  and in a few days will reinstall the
whole software, thus I want to change HDD configuration that was done long
time ago and seems to be inefficient - each node has 2x1TB drives that are
LVMed to one single volume.



How do people usually setup drives? For example will it be better to mount
them to two separate folders and feed these folder to both tasktracker and
datanode? Or setup LVM with raid 0 to maximize bandwidth.



What I want is that 2TB of drive space per node were equally accessible to
both tasktracker and datanode, and I'm not sure that mounting two drives to
separate folders achieves that.  (for example if reducer fills one drive
will it start writing the rest of the data to second drive? )



Thanks.