Fw: problem in assigning an array

2010-04-13 Thread pinky priya


--- On Tue, 4/13/10, Dharani Selvaraj  wrote:

From: Dharani Selvaraj 
Subject: problem in assigning an array
To: k_gokulapr...@yahoo.com
Date: Tuesday, April 13, 2010, 4:16 PM

Hello,


   While  we are  trying  to assign the values to a 3-d  array,in  the  
map  function  we have  got  some  problem.Its not assigning properly.we have  
attached  the  map  code and the input file.Can you tell me how to solve this 
problem.   





   
The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.


  

Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel

2010-04-13 Thread stephen mulcahy

Todd Lipcon wrote:

Most likely a kernel bug. In previous versions of Debian there was a buggy
forcedeth driver, for example, that caused it to drop off the network in
high load. Who knows what new bug is in 2.6.32 which is brand spanking new.


Yes, it looks like it is a kernel bug alright (see thread on kernel 
netdev at http://marc.info/?t=12709428891&r=1&w=2 if interested). To 
be fair, I don't think these bugs are confined to Debian - I did some 
initial testing with Scientific Linux and also ran into problems with 
forcedeth.



The overwhelming majority of production clusters run on RHEL 5.3 or RHEL 5.4
in my experience (I'm lumping CentOS 5.3/5.4 in with RHEL here). I know one
or two production clusters running Debian Lenny, but none running something
as new as what you're talking about. 


This is useful info - much appreciated. I guess if we don't manage to 
stabilise the current config we'll look at moving to one of those.



Hadoop doesn't exercise the new
features in very recent kernels, so there's no sense accepting instability -
just go with something old that works!


Sure, but I figured I'd go with a distro now that can be largely left 
untouched for the next 2-3 years and Debian lenny felt that bit old for 
that. I know RHEL/CentOS would fit that requirement also, will see. I'm 
also interested in using DRBD in some of our nodes for redundancy, 
again, running with a newer distro should reduce the pain of configuring 
that.


Finally, I figured burning in our cluster was a good opportunity to give 
back to the community and do some testing on their behalf.


With regard to our TeraSort benchmark time of ~23 minutes - is that in 
the right ballpark for a cluster of 45 data nodes and a nn and 2nn?


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


How do I use MapFile Reader and Writer

2010-04-13 Thread Placebo

I have a large text file, approximately 500mb containing key value pairs on
each line. I would like to implement Hadoop MapFile so that I can access any
key,value pair fairly quickly. To construct either the Reader or Writer the
MapFile requires a Configurations object and a File System object. I am
confused as to how to create either object, and why they are necessary.
Would someone be so kind to demonstrate to me a trivial example as to how I
can accomplish this.

Thanks in advance.
-- 
View this message in context: 
http://old.nabble.com/How-do-I-use-MapFile-Reader-and-Writer-tp28230683p28230683.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
Correction, they are 100Mbps NIC's...

iperf shows that we're getting about 95 Mbits/sec from one node to another.

On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:

> @Todd:
> 
> I do need the sorting behavior, eventually.  However, I'll try it with zero 
> reduce jobs to see.
> 
> @Alex:
> 
> Yes, I was planning on incrementally building my mapper and reducer functions 
> so currently, the mapper takes the value and multiplies by the gain and adds 
> the offset and outputs a new key/value pair.
> 
> Started to run the tests but didn't know about how long it should take with 
> the parameters you listed below.  However, it seemed like there was no 
> progress being made.  Ran it with a increasing parameter values and results 
> are included below:
> 
> Here is a run with nrFiles 1 and fileSize 10
> 
> had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
> TestDFSIO -write -nrFiles 1 -fileSize 10
> TestFDSIO.0.0.4
> 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
> 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
> 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100
> 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 mega 
> bytes, 1 files
> 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: 1 
> files
> 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 10/04/12 11:57:19 INFO mapred.JobClient: Running job: job_20100407_0017
> 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
> 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
> 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
> 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: job_20100407_0017
> 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
> 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters 
> 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1
> 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1
> 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1
> 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
> 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98
> 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113
> 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
> 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832
> 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
> 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5
> 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0
> 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1
> 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0
> 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5
> 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10
> 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82
> 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27
> 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0
> 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5
> 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5
> 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - : write
> 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date & time: Mon 
> Apr 12 11:57:41 PST 2010
> 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1
> 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
> 10/04/12 11:57:41 INFO mapred.FileInputFormat:  Throughput mb/sec: 
> 8.710801393728223
> 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
> 8.710801124572754
> 10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation: 
> 0.0017763302275007867
> 10/04/12 11:57:41 INFO mapred.FileInputFormat: Test exec time sec: 22.757
> 10/04/12 11:57:41 INFO mapred.FileInputFormat: 
> 
> Here is a run with nrFiles 10 and fileSize 100:
> 
> had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
> TestDFSIO -write -nrFiles 10 -fileSize 100
> TestFDSIO.0.0.4
> 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10
> 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100
> 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 100
> 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100 
> mega bytes, 10 files
> 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for: 10 
> files
> 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to process : 
> 10
> 10/04/12 11:58:55 INFO mapred.

Announcement: Hadoop Training - new courses (Hive and HBase), new locations, and discounts

2010-04-13 Thread Christophe Bisciglia
Hadoop Fans, we wanted to share some news with the Hadoop community
about new upcoming courses, new locations, and a substantial discount
on next week's session in the Bay Area.

We're excited to offer an extended sysadmin course and new courses on
Hive and HBase at this year's Hadoop Summit. You can see full details
here: http://www.cloudera.com/hadoop-training/hadoop-summit-2010/ (and
if you join us, we'll pick up your Hadoop Summit registration fee!)

In order to allow more time for the team to prepare, we're going to
cancel our public developer session in the Bay Area this May. We
realize this can cause some inconvenience for those planning to
attend, so we're offering a 30% discount if you'd like to join us next
week instead. For those of you that can't wait until June, you can
register using this link:
http://www.eventbrite.com/event/603632481?discount=apache_discount

We still have courses scheduled in NYC (developers:
http://www.eventbrite.com/event/596084906 sysadmins:
http://www.eventbrite.com/event/596093933) and Chicago (developers:
http://www.eventbrite.com/event/630785697) this May for those of you
that can't make it to California. Also, for our friends across the
pond, we recently opened courses in London (developers:
http://www.eventbrite.com/event/635703406) and Berlin (developers:
http://www.eventbrite.com/event/635728481) in June for early
registration.

Hope to see you soon!

Cheers,
Christophe and the Cloudera Team

-- 
get hadoop: cloudera.com/hadoop
online training: cloudera.com/hadoop-training
blog: cloudera.com/blog
twitter: twitter.com/cloudera


Re: Optimal setup for a test problem

2010-04-13 Thread alex kamil
Andrew,

here are some tips for hadoop runtime config:
http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html
also

here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500,
24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a
different 4-nodes cluster with HP G5s


- TestDFSIO - : write
   Date & time: Wed Mar 31 02:28:59 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 5.615639781416837
Average IO rate mb/sec: 5.631219863891602
 IO rate std deviation: 0.2928237500022612
Test exec time sec: 219.095

- TestDFSIO - : read
   Date & time: Wed Mar 31 02:32:21 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 10.662958800459787
Average IO rate mb/sec: 13.391314506530762
 IO rate std deviation: 8.181072283752508
Test exec time sec: 157.752

thanks
Alex

On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen  wrote:

> Correction, they are 100Mbps NIC's...
>
> iperf shows that we're getting about 95 Mbits/sec from one node to another.
>
> On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:
>
> > @Todd:
> >
> > I do need the sorting behavior, eventually.  However, I'll try it with
> zero reduce jobs to see.
> >
> > @Alex:
> >
> > Yes, I was planning on incrementally building my mapper and reducer
> functions so currently, the mapper takes the value and multiplies by the
> gain and adds the offset and outputs a new key/value pair.
> >
> > Started to run the tests but didn't know about how long it should take
> with the parameters you listed below.  However, it seemed like there was no
> progress being made.  Ran it with a increasing parameter values and results
> are included below:
> >
> > Here is a run with nrFiles 1 and fileSize 10
> >
> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
> TestDFSIO -write -nrFiles 1 -fileSize 10
> > TestFDSIO.0.0.4
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10
> mega bytes, 1 files
> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for:
> 1 files
> > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> > 10/04/12 11:57:19 INFO mapred.JobClient: Running job:
> job_20100407_0017
> > 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
> > 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
> > 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
> > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete:
> job_20100407_0017
> > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
> > 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters
> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98
> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113
> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832
> > 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5
> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0
> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1
> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0
> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5
> > 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10
> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82
> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27
> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0
> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5
> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - :
> write
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date & time:
> Mon Apr 12 11:57:41 PST 2010
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:  Throughput mb/sec:
> 8.710801393728223
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 8.7

Re: Optimal setup for a test problem

2010-04-13 Thread alex kamil
also http://www.slideshare.net/cloudera/hw09-optimizing-hadoop-deployments

On Tue, Apr 13, 2010 at 12:58 PM, alex kamil  wrote:

> Andrew,
>
> here are some tips for hadoop runtime config:
> http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html
> also
>
> here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500,
> 24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a
> different 4-nodes cluster with HP G5s
>
>
> - TestDFSIO - : write
>Date & time: Wed Mar 31 02:28:59 EDT 2010
>Number of files: 10
> Total MBytes processed: 1
>  Throughput mb/sec: 5.615639781416837
> Average IO rate mb/sec: 5.631219863891602
>  IO rate std deviation: 0.2928237500022612
> Test exec time sec: 219.095
>
> - TestDFSIO - : read
>Date & time: Wed Mar 31 02:32:21 EDT 2010
>Number of files: 10
> Total MBytes processed: 1
>  Throughput mb/sec: 10.662958800459787
> Average IO rate mb/sec: 13.391314506530762
>  IO rate std deviation: 8.181072283752508
> Test exec time sec: 157.752
>
> thanks
> Alex
>
> On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen  wrote:
>
>> Correction, they are 100Mbps NIC's...
>>
>> iperf shows that we're getting about 95 Mbits/sec from one node to
>> another.
>>
>> On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:
>>
>> > @Todd:
>> >
>> > I do need the sorting behavior, eventually.  However, I'll try it with
>> zero reduce jobs to see.
>> >
>> > @Alex:
>> >
>> > Yes, I was planning on incrementally building my mapper and reducer
>> functions so currently, the mapper takes the value and multiplies by the
>> gain and adds the offset and outputs a new key/value pair.
>> >
>> > Started to run the tests but didn't know about how long it should take
>> with the parameters you listed below.  However, it seemed like there was no
>> progress being made.  Ran it with a increasing parameter values and results
>> are included below:
>> >
>> > Here is a run with nrFiles 1 and fileSize 10
>> >
>> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
>> TestDFSIO -write -nrFiles 1 -fileSize 10
>> > TestFDSIO.0.0.4
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10
>> mega bytes, 1 files
>> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files
>> for: 1 files
>> > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> > 10/04/12 11:57:19 INFO mapred.JobClient: Running job:
>> job_20100407_0017
>> > 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
>> > 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
>> > 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete:
>> job_20100407_0017
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
>> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98
>> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113
>> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
>> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - :
>> write
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date & time:
>> Mon Apr 12 11:57:41 PST 2010
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number 

Re: Optimal setup for a test problem

2010-04-13 Thread Todd Lipcon
On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen <
andrew-lists-had...@ucsfcti.org> wrote:

> I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
> 95Mbit/sec from one node to another with iperf.
>
> Should I still be expecting such dismal performance with just 100Mbps?
>

Yes - in my experience on gigabit, when lots of transfers are going between
the nodes, TCP performance actually drops to around half the network
capacity. In the case of 100Mbps, this is probably going to be around
5MB/sec

So when you're writing output at 3x replication, it's going to be very very
slow on this network.

-Todd


>
> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
>
> > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
> > andrew-lists-had...@ucsfcti.org> wrote:
> >
> >> 5 identically spec'ed nodes, each has:
> >>
> >> 2 GB RAM
> >> Pentium 4 3.0G with HT
> >> 250GB HDD on PATA
> >> 10Mbps NIC
> >>
> >
> > This is probably your issue - 10mbps nic? I didn't know you could even
> get
> > those anymore!
> >
> > Hadoop runs on commodity hardware, but you're not likely to get
> reasonable
> > performance with hardware like that.
> >
> > -Todd
> >
> >
> >> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
> >>
> >>> Andrew,
> >>>
> >>> I would also suggest to run DFSIO benchmark to isolate io related
> issues
> >>>
> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
> -fileSize
> >> 1000
> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
> >> 1000
> >>>
> >>> there are additional tests specific for mapreduce -  run  "hadoop jar
> >> hadoop-0.20.2-test.jar" for the complete list
> >>>
> >>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
> >> gain/offset conversion is a simple algebraic manipulation
> >>>
> >>> it takes less than 5 min  to run a simple mapper (using streaming) on a
> >> 4 nodes cluster on something like 10GB, the mapper i used was an awk
> command
> >> extracting  pair from a log (no reducer)
> >>>
> >>> Thanks
> >>> Alex
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon 
> wrote:
> >>> Hi Andrew,
> >>>
> >>> Do you need the sorting behavior that having an identity reducer gives
> >> you?
> >>> If not, set the number of reduce tasks to 0 and you'll end up with a
> map
> >>> only job, which should be significantly faster.
> >>>
> >>> -Todd
> >>>
> >>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
> >>> andrew-lists-had...@ucsfcti.org> wrote:
> >>>
>  Hello,
> 
>  I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
> >> to
>  use it to process high volumes of patient physiologic data.  As an
> >> initial
>  exercise to gain a better understanding, I have attempted to run the
>  following problem (which isn't the type of problem that Hadoop was
> >> really
>  designed for, as is my understanding).
> 
>  I have a 6G data file, that contains key/value of  >> sample
>  value>.  I'd like to convert the values based on a gain/offset to
> their
>  physical units.  I've setup a MapReduce job using streaming where the
> >> mapper
>  does the conversion, and the reducer is just an identity reducer.
> >> Based on
>  other threads on the mailing list, my initial results are consistent
> in
> >> the
>  fact that it takes considerably more time to process this in Hadoop
> >> then it
>  is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
> >> single 6G
>  file and it looks like the file is being split into 101 map tasks.
> >> This is
>  consistent with the 64M block sizes.
> 
>  So my questions are:
> 
>  * Would it help to increase the block size to 128M?  Or, decrease the
> >> block
>  size?  What are some key factors to think about with this question?
>  * Are there any other optimizations that I could employ?  I have
> looked
>  into LzoCompression but I'd like to still work without compression
> >> since the
>  single thread job that I'm comparing to doesn't use any sort of
> >> compression.
>  I know I'm comparing apples to pears a little here so please feel free
> >> to
>  correct this assumption.
>  * Is Hadoop really only good for jobs where the data doesn't fit on a
>  single node?  At some level, I assume that it can still speedup jobs
> >> that do
>  fit on one node, if only because you are performing tasks in parallel.
> 
>  Thanks!
> 
>  --Andrew
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >>
> >>
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel

2010-04-13 Thread Todd Lipcon
On Tue, Apr 13, 2010 at 4:13 AM, stephen mulcahy
wrote:

> Todd Lipcon wrote:
>
>> Most likely a kernel bug. In previous versions of Debian there was a buggy
>> forcedeth driver, for example, that caused it to drop off the network in
>> high load. Who knows what new bug is in 2.6.32 which is brand spanking
>> new.
>>
>
> Yes, it looks like it is a kernel bug alright (see thread on kernel netdev
> at http://marc.info/?t=12709428891&r=1&w=2 if interested). To be fair,
> I don't think these bugs are confined to Debian - I did some initial testing
> with Scientific Linux and also ran into problems with forcedeth.


Interesting, good find. I try to avoid forcedeth now and have heard the same
from ops people at various large linux deployments. Not sure why, but it's
traditionally had a lot of bugs/regressions.


> Sure, but I figured I'd go with a distro now that can be largely left
> untouched for the next 2-3 years and Debian lenny felt that bit old for
> that. I know RHEL/CentOS would fit that requirement also, will see. I'm also
> interested in using DRBD in some of our nodes for redundancy, again, running
> with a newer distro should reduce the pain of configuring that.
>
> Finally, I figured burning in our cluster was a good opportunity to give
> back to the community and do some testing on their behalf.
>

Very admirable of you :) It is good to have some people running new kernels
to suss these issues out before the rest of us check out modern technology
;-)


>
> With regard to our TeraSort benchmark time of ~23 minutes - is that in the
> right ballpark for a cluster of 45 data nodes and a nn and 2nn?
>
>
Yep, sounds about the right ballpark.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
Good to know...  The problem is that I'm in an academic environment that
needs a lot of convincing regarding new computational technologies.  I need
to show proven benefit before getting the funds to actually implement
anything.  These servers were the best I could come up with for this
proof-of-concept.

I changed some settings on the nodes and have been experimenting - and I'm
seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
observations below.

Given that, would increasing the block sizes help my performance?  This
should result in fewer map jobs and keeping the computation locally,
longer...?  I just need to show that the numbers are better than a single
machine, even if sacrificing redundancy (or other factors) in the current
setup.

@alex:

Thanks for the links, it gives me another bit of evidence to convince
those controlling the money flow...

--Andrew

On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon  wrote:
> On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen <
> andrew-lists-had...@ucsfcti.org> wrote:
> 
>> I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
>> 95Mbit/sec from one node to another with iperf.
>>
>> Should I still be expecting such dismal performance with just 100Mbps?
>>
> 
> Yes - in my experience on gigabit, when lots of transfers are going
between
> the nodes, TCP performance actually drops to around half the network
> capacity. In the case of 100Mbps, this is probably going to be around
> 5MB/sec
> 
> So when you're writing output at 3x replication, it's going to be very
very
> slow on this network.
> 
> -Todd
> 
> 
>>
>> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
>>
>> > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
>> > andrew-lists-had...@ucsfcti.org> wrote:
>> >
>> >> 5 identically spec'ed nodes, each has:
>> >>
>> >> 2 GB RAM
>> >> Pentium 4 3.0G with HT
>> >> 250GB HDD on PATA
>> >> 10Mbps NIC
>> >>
>> >
>> > This is probably your issue - 10mbps nic? I didn't know you could
even
>> get
>> > those anymore!
>> >
>> > Hadoop runs on commodity hardware, but you're not likely to get
>> reasonable
>> > performance with hardware like that.
>> >
>> > -Todd
>> >
>> >
>> >> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
>> >>
>> >>> Andrew,
>> >>>
>> >>> I would also suggest to run DFSIO benchmark to isolate io related
>> issues
>> >>>
>> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
>> -fileSize
>> >> 1000
>> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10
>> >>> -fileSize
>> >> 1000
>> >>>
>> >>> there are additional tests specific for mapreduce -  run  "hadoop
jar
>> >> hadoop-0.20.2-test.jar" for the complete list
>> >>>
>> >>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
>> >> gain/offset conversion is a simple algebraic manipulation
>> >>>
>> >>> it takes less than 5 min  to run a simple mapper (using streaming)
>> >>> on a
>> >> 4 nodes cluster on something like 10GB, the mapper i used was an awk
>> command
>> >> extracting  pair from a log (no reducer)
>> >>>
>> >>> Thanks
>> >>> Alex
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon 
>> wrote:
>> >>> Hi Andrew,
>> >>>
>> >>> Do you need the sorting behavior that having an identity reducer
>> >>> gives
>> >> you?
>> >>> If not, set the number of reduce tasks to 0 and you'll end up with
a
>> map
>> >>> only job, which should be significantly faster.
>> >>>
>> >>> -Todd
>> >>>
>> >>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>> >>> andrew-lists-had...@ucsfcti.org> wrote:
>> >>>
>>  Hello,
>> 
>>  I recently setup a 5 node cluster (1 master, 4 slaves) and am
>>  looking
>> >> to
>>  use it to process high volumes of patient physiologic data.  As an
>> >> initial
>>  exercise to gain a better understanding, I have attempted to run
the
>>  following problem (which isn't the type of problem that Hadoop was
>> >> really
>>  designed for, as is my understanding).
>> 
>>  I have a 6G data file, that contains key/value of > >> sample
>>  value>.  I'd like to convert the values based on a gain/offset to
>> their
>>  physical units.  I've setup a MapReduce job using streaming where
>>  the
>> >> mapper
>>  does the conversion, and the reducer is just an identity reducer.
>> >> Based on
>>  other threads on the mailing list, my initial results are
consistent
>> in
>> >> the
>>  fact that it takes considerably more time to process this in
Hadoop
>> >> then it
>>  is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
>> >> single 6G
>>  file and it looks like the file is being split into 101 map tasks.
>> >> This is
>>  consistent with the 64M block sizes.
>> 
>>  So my questions are:
>> 
>>  * Would it help to increase the block size to 128M?  Or, decrease
>>  the
>> >> block
>>  size?  What are some key factors to think about with this
question?
>>  * Are there any other optimizati

Re: Optimal setup for a test problem

2010-04-13 Thread Todd Lipcon
On Tue, Apr 13, 2010 at 11:40 AM, Andrew Nguyen <
andrew-lists-had...@ucsfcti.org> wrote:

> Good to know...  The problem is that I'm in an academic environment that
> needs a lot of convincing regarding new computational technologies.  I need
> to show proven benefit before getting the funds to actually implement
> anything.  These servers were the best I could come up with for this
> proof-of-concept.
>
> I changed some settings on the nodes and have been experimenting - and I'm
> seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
> observations below.
>
> Given that, would increasing the block sizes help my performance?  This
> should result in fewer map jobs and keeping the computation locally,
> longer...?  I just need to show that the numbers are better than a single
> machine, even if sacrificing redundancy (or other factors) in the current
> setup.
>
>
If that's your goal, set dfs.replication to 1 in your job - this will make
the output unreplicated, which means it won't go over the network. Of
course, you'll also lose data if a node goes down, but if your goal is to
cheat, it's an effective way of doing so!

You'll also get some benefit by using LZO compression to reduce the amount
of network transfer.

-Todd


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Optimal setup for a test problem

2010-04-13 Thread Brian Bockelman
Hey Andrew,

I can name 3 California universities (San Diego, Caltech, Santa-Barbera) that 
use Hadoop at a small (~20TB raw) or medium scale (~800TB raw).  Why not go 
talk to those guys?

Otherwise, you might just be able to confirm old hardware is old  (there's good 
money that you might be hard-drive limited, not network limited anyway.  
3.4MB/s triple replicated = 10MB/s on PATA, which might approach the hardware 
capability).  Alternately, you can always try running on Amazon, which allows 
you to test scaling at a very, very marginal cost.

Brian

On Apr 13, 2010, at 1:40 PM, Andrew Nguyen wrote:

> Good to know...  The problem is that I'm in an academic environment that
> needs a lot of convincing regarding new computational technologies.  I need
> to show proven benefit before getting the funds to actually implement
> anything.  These servers were the best I could come up with for this
> proof-of-concept.
> 
> I changed some settings on the nodes and have been experimenting - and I'm
> seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
> observations below.
> 
> Given that, would increasing the block sizes help my performance?  This
> should result in fewer map jobs and keeping the computation locally,
> longer...?  I just need to show that the numbers are better than a single
> machine, even if sacrificing redundancy (or other factors) in the current
> setup.
> 
> @alex:
> 
> Thanks for the links, it gives me another bit of evidence to convince
> those controlling the money flow...
> 
> --Andrew
> 
> On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon  wrote:
>> On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen <
>> andrew-lists-had...@ucsfcti.org> wrote:
>> 
>>> I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
>>> 95Mbit/sec from one node to another with iperf.
>>> 
>>> Should I still be expecting such dismal performance with just 100Mbps?
>>> 
>> 
>> Yes - in my experience on gigabit, when lots of transfers are going
> between
>> the nodes, TCP performance actually drops to around half the network
>> capacity. In the case of 100Mbps, this is probably going to be around
>> 5MB/sec
>> 
>> So when you're writing output at 3x replication, it's going to be very
> very
>> slow on this network.
>> 
>> -Todd
>> 
>> 
>>> 
>>> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
>>> 
 On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
 andrew-lists-had...@ucsfcti.org> wrote:
 
> 5 identically spec'ed nodes, each has:
> 
> 2 GB RAM
> Pentium 4 3.0G with HT
> 250GB HDD on PATA
> 10Mbps NIC
> 
 
 This is probably your issue - 10mbps nic? I didn't know you could
> even
>>> get
 those anymore!
 
 Hadoop runs on commodity hardware, but you're not likely to get
>>> reasonable
 performance with hardware like that.
 
 -Todd
 
 
> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
> 
>> Andrew,
>> 
>> I would also suggest to run DFSIO benchmark to isolate io related
>>> issues
>> 
>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
>>> -fileSize
> 1000
>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10
>> -fileSize
> 1000
>> 
>> there are additional tests specific for mapreduce -  run  "hadoop
> jar
> hadoop-0.20.2-test.jar" for the complete list
>> 
>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
> gain/offset conversion is a simple algebraic manipulation
>> 
>> it takes less than 5 min  to run a simple mapper (using streaming)
>> on a
> 4 nodes cluster on something like 10GB, the mapper i used was an awk
>>> command
> extracting  pair from a log (no reducer)
>> 
>> Thanks
>> Alex
>> 
>> 
>> 
>> 
>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon 
>>> wrote:
>> Hi Andrew,
>> 
>> Do you need the sorting behavior that having an identity reducer
>> gives
> you?
>> If not, set the number of reduce tasks to 0 and you'll end up with
> a
>>> map
>> only job, which should be significantly faster.
>> 
>> -Todd
>> 
>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>> andrew-lists-had...@ucsfcti.org> wrote:
>> 
>>> Hello,
>>> 
>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am
>>> looking
> to
>>> use it to process high volumes of patient physiologic data.  As an
> initial
>>> exercise to gain a better understanding, I have attempted to run
> the
>>> following problem (which isn't the type of problem that Hadoop was
> really
>>> designed for, as is my understanding).
>>> 
>>> I have a 6G data file, that contains key/value of  sample
>>> value>.  I'd like to convert the values based on a gain/offset to
>>> their
>>> physical units.  I've setup a MapReduce job using streaming where
>>> the
> mapper
>>> does the conve

Per-file block size

2010-04-13 Thread Andrew Nguyen
I thought I saw a way to specify the block size for individual files using the 
command-line using "hadoop dfs -put/copyFromLocal..."  However, I can't seem to 
find the reference anywhere.

I see that I can do it via the API but no references to a command-line 
mechanism.  Am I just remembering something that doesn't exist?  Or, can some 
point me in the right direction.

Thanks!

--Andrew

Re: Per-file block size

2010-04-13 Thread Amogh Vasekar
Hi,
Pass the -D property in command line. eg:
Hadoop fs -Ddfs.block.size= .
You can check if its actually set the way you needed by hadoop fs -stat %o 


HTH,
Amogh


On 4/14/10 9:01 AM, "Andrew Nguyen"  wrote:

I thought I saw a way to specify the block size for individual files using the 
command-line using "hadoop dfs -put/copyFromLocal..."  However, I can't seem to 
find the reference anywhere.

I see that I can do it via the API but no references to a command-line 
mechanism.  Am I just remembering something that doesn't exist?  Or, can some 
point me in the right direction.

Thanks!

--Andrew



Re: How do I use MapFile Reader and Writer

2010-04-13 Thread Amogh Vasekar
Hi,
The file system object will contain the scheme, authority etc for the given uri 
or path. The conf object acts as reference ( unable to get a better terminology 
) to this info.
Looking at the MapFileOutputFormat should help provide better understanding as 
to how writers and readers are initialized.

Hope this helps,
Amogh


On 4/13/10 7:33 PM, "Placebo"  wrote:



I have a large text file, approximately 500mb containing key value pairs on
each line. I would like to implement Hadoop MapFile so that I can access any
key,value pair fairly quickly. To construct either the Reader or Writer the
MapFile requires a Configurations object and a File System object. I am
confused as to how to create either object, and why they are necessary.
Would someone be so kind to demonstrate to me a trivial example as to how I
can accomplish this.

Thanks in advance.
--
View this message in context: 
http://old.nabble.com/How-do-I-use-MapFile-Reader-and-Writer-tp28230683p28230683.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




stop scripts not working properly

2010-04-13 Thread abhishek sharma
Hi all,

I am using the Cloudera Hadoop distribution version 0.20.2+228.

I have a small 9 node cluster and when I try to stop the Hadoop DFS
and Mapred using
the stop-mapred.sh and stop-dfs.sh scripts, it downs shutdown some of
the TaskTrackers and DataNodes. I get a message saying no tasktracker
or datanode to stop, but when I log into the machines, I can see the
TaskTracker and DataNode processes running (for e.g. using jps).

I did not notice anything unusal in the log files. I am not sure what
might be the problem but when I use Hadoop version 0.20.0, the scripts
work fine.

Any idea what time be happening?

Thanks,
Abhishek


Re: stop scripts not working properly

2010-04-13 Thread Todd Lipcon
Hi Abhishek,

Are you using the tarball or the RPMs/debs? The issue is most likely that
your pid files are ending up in /tmp and thus getting cleaned out
periodically.

-Todd

On Tue, Apr 13, 2010 at 11:07 PM, abhishek sharma wrote:

> Hi all,
>
> I am using the Cloudera Hadoop distribution version 0.20.2+228.
>
> I have a small 9 node cluster and when I try to stop the Hadoop DFS
> and Mapred using
> the stop-mapred.sh and stop-dfs.sh scripts, it downs shutdown some of
> the TaskTrackers and DataNodes. I get a message saying no tasktracker
> or datanode to stop, but when I log into the machines, I can see the
> TaskTracker and DataNode processes running (for e.g. using jps).
>
> I did not notice anything unusal in the log files. I am not sure what
> might be the problem but when I use Hadoop version 0.20.0, the scripts
> work fine.
>
> Any idea what time be happening?
>
> Thanks,
> Abhishek
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: stop scripts not working properly

2010-04-13 Thread abhishek sharma
Hi Todd,

I am using the tarball.

Let me try configuring the pid files to stored somewhere else.

Thanks for the tip,
Abhishek

On Tue, Apr 13, 2010 at 11:10 PM, Todd Lipcon  wrote:
> Hi Abhishek,
>
> Are you using the tarball or the RPMs/debs? The issue is most likely that
> your pid files are ending up in /tmp and thus getting cleaned out
> periodically.
>
> -Todd
>
> On Tue, Apr 13, 2010 at 11:07 PM, abhishek sharma wrote:
>
>> Hi all,
>>
>> I am using the Cloudera Hadoop distribution version 0.20.2+228.
>>
>> I have a small 9 node cluster and when I try to stop the Hadoop DFS
>> and Mapred using
>> the stop-mapred.sh and stop-dfs.sh scripts, it downs shutdown some of
>> the TaskTrackers and DataNodes. I get a message saying no tasktracker
>> or datanode to stop, but when I log into the machines, I can see the
>> TaskTracker and DataNode processes running (for e.g. using jps).
>>
>> I did not notice anything unusal in the log files. I am not sure what
>> might be the problem but when I use Hadoop version 0.20.0, the scripts
>> work fine.
>>
>> Any idea what time be happening?
>>
>> Thanks,
>> Abhishek
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: cluster under-utilization with Hadoop Fair Scheduler

2010-04-13 Thread abhishek sharma
Hi Ted,

Were you referring to the Hadoop 0.20.2 distribution or the CDH version?

I just looked at the FairScheduler assignTasks function in Hadoop
dist. 0.20.2 and it is the same as version 0.20.0, and it will assign
only 1 Map and 1 reduce task to a tasktracker per heartbeat (as far I
can tell by reading the code and my experiments).

Abhishek



On Sun, Apr 11, 2010 at 2:51 PM, Ted Yu  wrote:
> Reading assignTasks() in 0.20.2 reveals that the number of map tasks
> assigned is not limited to 1 per heartbeat.
>
> Cheers
>
> On Sun, Apr 11, 2010 at 12:30 PM, Todd Lipcon  wrote:
>
>> Hi Abhishek,
>>
>> This behavior is improved by MAPREDUCE-706 I believe (not certain that
>> that's the JIRA, but I know it's fixed in trunk fairscheduler). These
>> patches are included in CDH3 (currently in beta)
>> http://archive.cloudera.com/cdh/3/
>>
>> In general, though, map tasks that are so short are not going to be very
>> efficient - even with fast assignment there is some constant overhead per
>> task.
>>
>> Thanks
>> -Todd
>>
>> On Sun, Apr 11, 2010 at 11:42 AM, abhishek sharma 
>> wrote:
>>
>> > Hi all,
>> >
>> > I have been using the Hadoop Fair Scheduler for some experiments on a
>> > 100 node cluster with 2 map slots per node (hence, a total of 200 map
>> > slots).
>> >
>> > In one of my experiments, all the map tasks finish within a heartbeat
>> > interval of 3 seconds. I noticed that the maximum number of
>> > concurrently
>> > active map slots on my cluster never exceeds 100, and hence, the
>> > cluster utilization during my experiments never exceeds 50% even when
>> > large jobs with more than a 1000 maps are being executed.
>> >
>> > A look at the Fair Scheduler code (in particular, the assignTasks
>> > function) revealed the reason.
>> > As per my understanding, with the implementation in Hadoop 0.20.0, a
>> > TaskTracker is not assigned more than 1 map and 1 reduce task per
>> > heart beat.
>> >
>> > In my experiments, in every heart beat, each TT has 2 free map slots
>> > but is assigned only 1 map task, and hence, the utilization never goes
>> > beyond 50%.
>> >
>> > Of course, this (degenerate) case does not arise when map tasks take
>> > more than one 1 heart beat interval to finish. For example, I repeated
>> > the experiments with maps tasks taking close to 15 s to finish and
>> > noticed close to 100 % utilization when large jobs were executing.
>> >
>> > Why does the Fair Scheduler not assign more than one map task to a TT
>> > per heart beat? Is this done to spread the load uniformly across the
>> > cluster?
>> > I looked at assignTasks function in the default Hadoop scheduler
>> > (JobQueueTaskScheduler.java), and it does assign more than 1 map task
>> > per heart beat to a TT.
>> >
>> > It will be easy to change the Fair Scheduler to assign more than 1 map
>> > task to a TT per heart beat (I did that and achieved 100% utilization
>> > even with small map tasks). But I am wondering, if doing so will
>> > violate some fairness properties.
>> >
>> > Thanks,
>> > Abhishek
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>