Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
Correction, they are 100Mbps NIC's...

iperf shows that we're getting about 95 Mbits/sec from one node to another.

On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:

 @Todd:
 
 I do need the sorting behavior, eventually.  However, I'll try it with zero 
 reduce jobs to see.
 
 @Alex:
 
 Yes, I was planning on incrementally building my mapper and reducer functions 
 so currently, the mapper takes the value and multiplies by the gain and adds 
 the offset and outputs a new key/value pair.
 
 Started to run the tests but didn't know about how long it should take with 
 the parameters you listed below.  However, it seemed like there was no 
 progress being made.  Ran it with a increasing parameter values and results 
 are included below:
 
 Here is a run with nrFiles 1 and fileSize 10
 
 had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
 TestDFSIO -write -nrFiles 1 -fileSize 10
 TestFDSIO.0.0.4
 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 mega 
 bytes, 1 files
 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: 1 
 files
 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 10/04/12 11:57:19 INFO mapred.JobClient: Running job: job_20100407_0017
 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: job_20100407_0017
 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters 
 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1
 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1
 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1
 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98
 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113
 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832
 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5
 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0
 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1
 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0
 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5
 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10
 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82
 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27
 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0
 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5
 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5
 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - : write
 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date  time: Mon 
 Apr 12 11:57:41 PST 2010
 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1
 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
 10/04/12 11:57:41 INFO mapred.FileInputFormat:  Throughput mb/sec: 
 8.710801393728223
 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
 8.710801124572754
 10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation: 
 0.0017763302275007867
 10/04/12 11:57:41 INFO mapred.FileInputFormat: Test exec time sec: 22.757
 10/04/12 11:57:41 INFO mapred.FileInputFormat: 
 
 Here is a run with nrFiles 10 and fileSize 100:
 
 had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar 
 TestDFSIO -write -nrFiles 10 -fileSize 100
 TestFDSIO.0.0.4
 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10
 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100
 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 100
 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100 
 mega bytes, 10 files
 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for: 10 
 files
 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to process : 
 10
 10/04/12 11:58:55 INFO mapred.JobClient: Running job: job_20100407_0018
 10/04/12 11:58:56 INFO mapred.JobClient:  

Re: Optimal setup for a test problem

2010-04-13 Thread alex kamil
Andrew,

here are some tips for hadoop runtime config:
http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html
also

here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500,
24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a
different 4-nodes cluster with HP G5s


- TestDFSIO - : write
   Date  time: Wed Mar 31 02:28:59 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 5.615639781416837
Average IO rate mb/sec: 5.631219863891602
 IO rate std deviation: 0.2928237500022612
Test exec time sec: 219.095

- TestDFSIO - : read
   Date  time: Wed Mar 31 02:32:21 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 10.662958800459787
Average IO rate mb/sec: 13.391314506530762
 IO rate std deviation: 8.181072283752508
Test exec time sec: 157.752

thanks
Alex

On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen and...@ucsfcti.org wrote:

 Correction, they are 100Mbps NIC's...

 iperf shows that we're getting about 95 Mbits/sec from one node to another.

 On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:

  @Todd:
 
  I do need the sorting behavior, eventually.  However, I'll try it with
 zero reduce jobs to see.
 
  @Alex:
 
  Yes, I was planning on incrementally building my mapper and reducer
 functions so currently, the mapper takes the value and multiplies by the
 gain and adds the offset and outputs a new key/value pair.
 
  Started to run the tests but didn't know about how long it should take
 with the parameters you listed below.  However, it seemed like there was no
 progress being made.  Ran it with a increasing parameter values and results
 are included below:
 
  Here is a run with nrFiles 1 and fileSize 10
 
  had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
 TestDFSIO -write -nrFiles 1 -fileSize 10
  TestFDSIO.0.0.4
  10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
  10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
  10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100
  10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10
 mega bytes, 1 files
  10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for:
 1 files
  10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
  10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to
 process : 1
  10/04/12 11:57:19 INFO mapred.JobClient: Running job:
 job_20100407_0017
  10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
  10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
  10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
  10/04/12 11:57:41 INFO mapred.JobClient: Job complete:
 job_20100407_0017
  10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
  10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters
  10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1
  10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1
  10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1
  10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
  10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98
  10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113
  10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228
  10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832
  10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
  10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5
  10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0
  10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1
  10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0
  10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5
  10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10
  10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82
  10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27
  10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0
  10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5
  10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5
  10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - :
 write
  10/04/12 11:57:41 INFO mapred.FileInputFormat:Date  time:
 Mon Apr 12 11:57:41 PST 2010
  10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1
  10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
  10/04/12 11:57:41 INFO mapred.FileInputFormat:  Throughput mb/sec:
 8.710801393728223
  10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
 8.710801124572754
  10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation:
 0.0017763302275007867
  

Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
Good to know...  The problem is that I'm in an academic environment that
needs a lot of convincing regarding new computational technologies.  I need
to show proven benefit before getting the funds to actually implement
anything.  These servers were the best I could come up with for this
proof-of-concept.

I changed some settings on the nodes and have been experimenting - and I'm
seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
observations below.

Given that, would increasing the block sizes help my performance?  This
should result in fewer map jobs and keeping the computation locally,
longer...?  I just need to show that the numbers are better than a single
machine, even if sacrificing redundancy (or other factors) in the current
setup.

@alex:

Thanks for the links, it gives me another bit of evidence to convince
those controlling the money flow...

--Andrew

On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon t...@cloudera.com wrote:
 On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:
 
 I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
 95Mbit/sec from one node to another with iperf.

 Should I still be expecting such dismal performance with just 100Mbps?

 
 Yes - in my experience on gigabit, when lots of transfers are going
between
 the nodes, TCP performance actually drops to around half the network
 capacity. In the case of 100Mbps, this is probably going to be around
 5MB/sec
 
 So when you're writing output at 3x replication, it's going to be very
very
 slow on this network.
 
 -Todd
 
 

 On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:

  On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen 
  andrew-lists-had...@ucsfcti.org wrote:
 
  5 identically spec'ed nodes, each has:
 
  2 GB RAM
  Pentium 4 3.0G with HT
  250GB HDD on PATA
  10Mbps NIC
 
 
  This is probably your issue - 10mbps nic? I didn't know you could
even
 get
  those anymore!
 
  Hadoop runs on commodity hardware, but you're not likely to get
 reasonable
  performance with hardware like that.
 
  -Todd
 
 
  On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
 
  Andrew,
 
  I would also suggest to run DFSIO benchmark to isolate io related
 issues
 
  hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
 -fileSize
  1000
  hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10
  -fileSize
  1000
 
  there are additional tests specific for mapreduce -  run  hadoop
jar
  hadoop-0.20.2-test.jar for the complete list
 
  45 min for mapping 6GB on 5 nodes is way too high assuming your
  gain/offset conversion is a simple algebraic manipulation
 
  it takes less than 5 min  to run a simple mapper (using streaming)
  on a
  4 nodes cluster on something like 10GB, the mapper i used was an awk
 command
  extracting key:value pair from a log (no reducer)
 
  Thanks
  Alex
 
 
 
 
  On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com
 wrote:
  Hi Andrew,
 
  Do you need the sorting behavior that having an identity reducer
  gives
  you?
  If not, set the number of reduce tasks to 0 and you'll end up with
a
 map
  only job, which should be significantly faster.
 
  -Todd
 
  On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen 
  andrew-lists-had...@ucsfcti.org wrote:
 
  Hello,
 
  I recently setup a 5 node cluster (1 master, 4 slaves) and am
  looking
  to
  use it to process high volumes of patient physiologic data.  As an
  initial
  exercise to gain a better understanding, I have attempted to run
the
  following problem (which isn't the type of problem that Hadoop was
  really
  designed for, as is my understanding).
 
  I have a 6G data file, that contains key/value of sample number,
  sample
  value.  I'd like to convert the values based on a gain/offset to
 their
  physical units.  I've setup a MapReduce job using streaming where
  the
  mapper
  does the conversion, and the reducer is just an identity reducer.
  Based on
  other threads on the mailing list, my initial results are
consistent
 in
  the
  fact that it takes considerably more time to process this in
Hadoop
  then it
  is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
  single 6G
  file and it looks like the file is being split into 101 map tasks.
  This is
  consistent with the 64M block sizes.
 
  So my questions are:
 
  * Would it help to increase the block size to 128M?  Or, decrease
  the
  block
  size?  What are some key factors to think about with this
question?
  * Are there any other optimizations that I could employ?  I have
 looked
  into LzoCompression but I'd like to still work without compression
  since the
  single thread job that I'm comparing to doesn't use any sort of
  compression.
  I know I'm comparing apples to pears a little here so please feel
  free
  to
  correct this assumption.
  * Is Hadoop really only good for jobs where the data doesn't fit
on
  a
  single node?  At some level, I assume that it can still speedup
jobs
  that do
  fit 

Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
Hello,

I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it 
to process high volumes of patient physiologic data.  As an initial exercise to 
gain a better understanding, I have attempted to run the following problem 
(which isn't the type of problem that Hadoop was really designed for, as is my 
understanding).

I have a 6G data file, that contains key/value of sample number, sample 
value.  I'd like to convert the values based on a gain/offset to their 
physical units.  I've setup a MapReduce job using streaming where the mapper 
does the conversion, and the reducer is just an identity reducer.  Based on 
other threads on the mailing list, my initial results are consistent in the 
fact that it takes considerably more time to process this in Hadoop then it is 
on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single 6G file 
and it looks like the file is being split into 101 map tasks.  This is 
consistent with the 64M block sizes.  

So my questions are:

* Would it help to increase the block size to 128M?  Or, decrease the block 
size?  What are some key factors to think about with this question?
* Are there any other optimizations that I could employ?  I have looked into 
LzoCompression but I'd like to still work without compression since the single 
thread job that I'm comparing to doesn't use any sort of compression.  I know 
I'm comparing apples to pears a little here so please feel free to correct this 
assumption.
* Is Hadoop really only good for jobs where the data doesn't fit on a single 
node?  At some level, I assume that it can still speedup jobs that do fit on 
one node, if only because you are performing tasks in parallel.

Thanks!

--Andrew

Re: Optimal setup for a test problem

2010-04-12 Thread alex kamil
Andrew,

I would also suggest to run DFSIO benchmark to isolate io related issues

hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
1000
hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

there are additional tests specific for mapreduce -  run  hadoop jar
 hadoop-0.20.2-test.jar for the complete list

45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset
conversion is a simple algebraic manipulation

 it takes less than 5 min  to run a simple mapper (using streaming) on a 4
nodes cluster on something like 10GB, the mapper i used was an awk command
extracting key:value pair from a log (no reducer)

Thanks
Alex




On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Andrew,

 Do you need the sorting behavior that having an identity reducer gives you?
 If not, set the number of reduce tasks to 0 and you'll end up with a map
 only job, which should be significantly faster.

 -Todd

 On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:

  Hello,
 
  I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to
  use it to process high volumes of patient physiologic data.  As an
 initial
  exercise to gain a better understanding, I have attempted to run the
  following problem (which isn't the type of problem that Hadoop was really
  designed for, as is my understanding).
 
  I have a 6G data file, that contains key/value of sample number, sample
  value.  I'd like to convert the values based on a gain/offset to their
  physical units.  I've setup a MapReduce job using streaming where the
 mapper
  does the conversion, and the reducer is just an identity reducer.  Based
 on
  other threads on the mailing list, my initial results are consistent in
 the
  fact that it takes considerably more time to process this in Hadoop then
 it
  is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single
 6G
  file and it looks like the file is being split into 101 map tasks.  This
 is
  consistent with the 64M block sizes.
 
  So my questions are:
 
  * Would it help to increase the block size to 128M?  Or, decrease the
 block
  size?  What are some key factors to think about with this question?
  * Are there any other optimizations that I could employ?  I have looked
  into LzoCompression but I'd like to still work without compression since
 the
  single thread job that I'm comparing to doesn't use any sort of
 compression.
   I know I'm comparing apples to pears a little here so please feel free
 to
  correct this assumption.
  * Is Hadoop really only good for jobs where the data doesn't fit on a
  single node?  At some level, I assume that it can still speedup jobs that
 do
  fit on one node, if only because you are performing tasks in parallel.
 
  Thanks!
 
  --Andrew




 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get 95Mbit/sec 
from one node to another with iperf.

Should I still be expecting such dismal performance with just 100Mbps?

On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:

 On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:
 
 5 identically spec'ed nodes, each has:
 
 2 GB RAM
 Pentium 4 3.0G with HT
 250GB HDD on PATA
 10Mbps NIC
 
 
 This is probably your issue - 10mbps nic? I didn't know you could even get
 those anymore!
 
 Hadoop runs on commodity hardware, but you're not likely to get reasonable
 performance with hardware like that.
 
 -Todd
 
 
 On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
 
 Andrew,
 
 I would also suggest to run DFSIO benchmark to isolate io related issues
 
 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
 1000
 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
 1000
 
 there are additional tests specific for mapreduce -  run  hadoop jar
 hadoop-0.20.2-test.jar for the complete list
 
 45 min for mapping 6GB on 5 nodes is way too high assuming your
 gain/offset conversion is a simple algebraic manipulation
 
 it takes less than 5 min  to run a simple mapper (using streaming) on a
 4 nodes cluster on something like 10GB, the mapper i used was an awk command
 extracting key:value pair from a log (no reducer)
 
 Thanks
 Alex
 
 
 
 
 On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi Andrew,
 
 Do you need the sorting behavior that having an identity reducer gives
 you?
 If not, set the number of reduce tasks to 0 and you'll end up with a map
 only job, which should be significantly faster.
 
 -Todd
 
 On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:
 
 Hello,
 
 I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
 to
 use it to process high volumes of patient physiologic data.  As an
 initial
 exercise to gain a better understanding, I have attempted to run the
 following problem (which isn't the type of problem that Hadoop was
 really
 designed for, as is my understanding).
 
 I have a 6G data file, that contains key/value of sample number,
 sample
 value.  I'd like to convert the values based on a gain/offset to their
 physical units.  I've setup a MapReduce job using streaming where the
 mapper
 does the conversion, and the reducer is just an identity reducer.
 Based on
 other threads on the mailing list, my initial results are consistent in
 the
 fact that it takes considerably more time to process this in Hadoop
 then it
 is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
 single 6G
 file and it looks like the file is being split into 101 map tasks.
 This is
 consistent with the 64M block sizes.
 
 So my questions are:
 
 * Would it help to increase the block size to 128M?  Or, decrease the
 block
 size?  What are some key factors to think about with this question?
 * Are there any other optimizations that I could employ?  I have looked
 into LzoCompression but I'd like to still work without compression
 since the
 single thread job that I'm comparing to doesn't use any sort of
 compression.
 I know I'm comparing apples to pears a little here so please feel free
 to
 correct this assumption.
 * Is Hadoop really only good for jobs where the data doesn't fit on a
 single node?  At some level, I assume that it can still speedup jobs
 that do
 fit on one node, if only because you are performing tasks in parallel.
 
 Thanks!
 
 --Andrew
 
 
 
 
 --
 Todd Lipcon
 Software Engineer, Cloudera
 
 
 
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera



Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I guess my question below can be rephrased as, What is the absolute minimum hw 
requirements for me to still see 'better-than-a-single-machine' performance?

Thanks!

On Apr 12, 2010, at 1:45 PM, Andrew Nguyen wrote:

 I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get 
 95Mbit/sec from one node to another with iperf.
 
 Should I still be expecting such dismal performance with just 100Mbps?
 
 On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
 
 On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:
 
 5 identically spec'ed nodes, each has:
 
 2 GB RAM
 Pentium 4 3.0G with HT
 250GB HDD on PATA
 10Mbps NIC
 
 
 This is probably your issue - 10mbps nic? I didn't know you could even get
 those anymore!
 
 Hadoop runs on commodity hardware, but you're not likely to get reasonable
 performance with hardware like that.
 
 -Todd
 
 
 On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
 
 Andrew,
 
 I would also suggest to run DFSIO benchmark to isolate io related issues
 
 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
 1000
 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
 1000
 
 there are additional tests specific for mapreduce -  run  hadoop jar
 hadoop-0.20.2-test.jar for the complete list
 
 45 min for mapping 6GB on 5 nodes is way too high assuming your
 gain/offset conversion is a simple algebraic manipulation
 
 it takes less than 5 min  to run a simple mapper (using streaming) on a
 4 nodes cluster on something like 10GB, the mapper i used was an awk command
 extracting key:value pair from a log (no reducer)
 
 Thanks
 Alex
 
 
 
 
 On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote:
 Hi Andrew,
 
 Do you need the sorting behavior that having an identity reducer gives
 you?
 If not, set the number of reduce tasks to 0 and you'll end up with a map
 only job, which should be significantly faster.
 
 -Todd
 
 On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:
 
 Hello,
 
 I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
 to
 use it to process high volumes of patient physiologic data.  As an
 initial
 exercise to gain a better understanding, I have attempted to run the
 following problem (which isn't the type of problem that Hadoop was
 really
 designed for, as is my understanding).
 
 I have a 6G data file, that contains key/value of sample number,
 sample
 value.  I'd like to convert the values based on a gain/offset to their
 physical units.  I've setup a MapReduce job using streaming where the
 mapper
 does the conversion, and the reducer is just an identity reducer.
 Based on
 other threads on the mailing list, my initial results are consistent in
 the
 fact that it takes considerably more time to process this in Hadoop
 then it
 is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
 single 6G
 file and it looks like the file is being split into 101 map tasks.
 This is
 consistent with the 64M block sizes.
 
 So my questions are:
 
 * Would it help to increase the block size to 128M?  Or, decrease the
 block
 size?  What are some key factors to think about with this question?
 * Are there any other optimizations that I could employ?  I have looked
 into LzoCompression but I'd like to still work without compression
 since the
 single thread job that I'm comparing to doesn't use any sort of
 compression.
 I know I'm comparing apples to pears a little here so please feel free
 to
 correct this assumption.
 * Is Hadoop really only good for jobs where the data doesn't fit on a
 single node?  At some level, I assume that it can still speedup jobs
 that do
 fit on one node, if only because you are performing tasks in parallel.
 
 Thanks!
 
 --Andrew
 
 
 
 
 --
 Todd Lipcon
 Software Engineer, Cloudera
 
 
 
 
 
 -- 
 Todd Lipcon
 Software Engineer, Cloudera