RE: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Tu, Tiankai
Thanks for the comments, Matei.

The machines I ran the experiments have 16 GB memory each. I don't see
how 64 MB buffer could be huge or is bad for memory consumption. In
fact, I set it to much larger value after initial rounds of tests showed
abysmal results using the default 64 KB buffer. Also, why is it
necessary to compute checksum for every 512 bytes why only an end-to-end
(whole file) checksum makes sense? I set it to a much larger value to
avoid the overhead. 

I didn't quite understand what you meant by bad for cache locality. The
jobs were IO bound in the first place. Any cache effect came second---at
least an order of magnitude negligible. Can you clarify which particular
computation (maybe within Hadoop) that was made slow because of a large
io buffer size?

What bothered you was exactly what bothered me and prompted me to ask
the question why the job tracker reported much more bytes read by the
map task. I can confirm that the experiments were set up correctly. In
fact, the numbers of map tasks were correctly reported by the job
tracker. There were 1600 for the 1GB file dataset, 6400 for the 256MB
file dataset, and so forth. 

Tiankai

 

-Original Message-
From: Matei Zaharia [mailto:ma...@cloudera.com] 
Sent: Friday, April 03, 2009 11:21 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop/HDFS for scientific simulation output data analysis

Hi Tiankai,

The one strange thing I see in your configuration as described is IO
buffer
size and IO bytes per checksum set to 64 MB. This is much higher than
the
recommended defaults, which are about 64 KB for buffer size and 1 KB or
512
bytes for checksum. (Actually I haven't seen anyone change checksum from
its
default of 512 bytes). Having huge buffers is bad for memory consumption
and
cache locality.

The other thing that bothers me is that on your 64 MB data set, you have
28
TB of HDFS bytes read. This is off from number of map tasks * bytes per
map
by an order of magnitude. Are you sure that you've generated the data
set
correctly and that it's the only input path given to your job? Does
bin/hadoop dfs -dus path to dataset come out as 1.6 TB?

Matei

On Sat, Mar 28, 2009 at 4:10 PM, Tu, Tiankai
tiankai...@deshawresearch.comwrote:

 Hi,

 I have been exploring the feasibility of using Hadoop/HDFS to analyze
 terabyte-scale scientific simulation output datasets. After a set of
 initial experiments, I have a number of questions regarding (1) the
 configuration setting and (2) the IO read performance.





 --
 Unlike typical Hadoop applications, post-simulation analysis usually
 processes one file at a time. So I wrote a
 WholeFileInputFormat/WholeFileRecordReader that reads an entire file
 without parsing the content, as suggested by the Hadoop wiki FAQ.

 Specifying WholeFileInputFormat as as input file format
 (conf.setInputFormat(FrameInputFormat.class), I constructed a simple
 MapReduce program with the sole purpose to measure how fast
Hadoop/HDFS
 can read data. Here is the gist of the test program:

 - The map method does nothing, it returns immediately when called
 - No reduce task (conf.setNumReduceTasks(0)
 - JVM reused (conf.setNumTasksToExecutePerJvm(-1))

 The detailed hardware/software configurations are listed below:

 Hardware:
 - 103 nodes, each with two 2.33GHz quad-core processors and 16 GB
memory
 - 1 GigE connection out of each node and connecting to a 1GigE switch
in
 the rack (3 racks in total)
 - Each rack switch has 4 10-GigE connections to a backbone
 full-bandwidth 10-GigE switch (second-tier switch)
 - Software (md) RAID0 on 4 SATA disks, with a capacity of 500 GB per
 node
 - Raw RAID0 bulk data transfer rate around 200 MB/s  (dd a 4GB file
 after dropping linux vfs cache)

 Software:
 - 2.6.26-10smp kernel
 - Hadoop 0.19.1
 - Three nodes as namenode, secondary name node, and job tracker,
 respectively
 - Remaining 100 node as slaves, each running as both datanode and
 tasktracker

 Relevant hadoop-site.xml setting:
 - dfs.namenode.handler.count = 100
 - io.file.buffer.size = 67108864
 - io.bytes.per.checksum = 67108864
 - mapred.task.timeout = 120
 - mapred.map.max.attempts = 8
 - mapred.tasktracker.map.tasks.maximum = 8
 - dfs.replication = 3
 - toploogy.script.file.name set properly to a correct Python script

 Dataset characteristics:

 - Four datasets consisting of files of 1 GB, 256 MB, 64 MB, and 2 MB,
 respectively
 - Each dataset has 1.6 terabyte data (that is, 1600 1GB files, 6400
 256MB files, etc.)
 - Datasets populated into HDFS via a parallel C MPI program (linked
with
 libhdfs.so) running on the 100 slave nodes
 - dfs block size set to be the file size (otherwise, accessing a
single
 file may require network data transfer)

 I launched the test MapReduce job one after another (so

Re: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Matei Zaharia
Hadoop does checksums for each small chunk of the file (512 bytes by
default) and stores a list of checksums for each chunk with the file, rather
than storing just one checksum at the end. This makes it easier to read only
a part of a file and know that it's not corrupt without having to read and
checksum the whole file. It also lets you use smaller / simpler checksums
for each chunk, making them more efficient to compute than the larger
checksum that would be needed to provide the same level of safety for the
whole file.

The default buffer size is confusingly not 64 KB, it's 4 KB. It really is
bad for performance as you saw. But I'd recommend trying 64 or 128 KB before
jumping to 64 MB. 128K is also the setting Yahoo used in its 2000-node
performance tests (see http://wiki.apache.org/hadoop/FAQ).

The reason big buffers may impair cache locality is that CPU caches are
typically a few MB. If you set your checksum size and buffer size to 64 MB,
then whenever you read a block, the CPU first has to checksum 64 MB worth of
data then start again at the beginning to read it and pass it through your
application. During the checksumming process, the first pages of data fell
out of CPU cache as you checksummed the later ones. Therefore, you have to
read them from memory again during the second scan. If you just had a 64 KB
block, it would stay in cache since the first time you read it. The same
issue happens if instead of checksumming you were copying from one buffer to
another (which does happen at a few places in Hadoop, and they tend to use
io.file.buffer.size). So while I haven't tried measuring performance with 64
MB vs 128 KB, I wouldn't be surprised if it leads to bad behavior, because
it's much higher than what anyone runs in production.

Finally, if you just want to sequentially process a file on each node and
you only want one logical input record per map, it might be better not to
use an input format that reads the record into memory at all. Instead, you
can have the map directly open the file, and have your InputFormat just
locate the map on the right node. This avoids copying the whole file into
memory before streaming it through your mapper. If your algorithm does
require random access throughout the file on the other hand, you do need to
read it all in. I think the WholeFileRecordReader in the FAQ is aimed at
smaller files than 256 MB / 1 GB.

On Fri, Apr 3, 2009 at 9:37 AM, Tu, Tiankai
tiankai...@deshawresearch.comwrote:

 Thanks for the comments, Matei.

 The machines I ran the experiments have 16 GB memory each. I don't see
 how 64 MB buffer could be huge or is bad for memory consumption. In
 fact, I set it to much larger value after initial rounds of tests showed
 abysmal results using the default 64 KB buffer. Also, why is it
 necessary to compute checksum for every 512 bytes why only an end-to-end
 (whole file) checksum makes sense? I set it to a much larger value to
 avoid the overhead.

 I didn't quite understand what you meant by bad for cache locality. The
 jobs were IO bound in the first place. Any cache effect came second---at
 least an order of magnitude negligible. Can you clarify which particular
 computation (maybe within Hadoop) that was made slow because of a large
 io buffer size?

 What bothered you was exactly what bothered me and prompted me to ask
 the question why the job tracker reported much more bytes read by the
 map task. I can confirm that the experiments were set up correctly. In
 fact, the numbers of map tasks were correctly reported by the job
 tracker. There were 1600 for the 1GB file dataset, 6400 for the 256MB
 file dataset, and so forth.

 Tiankai



 -Original Message-
 From: Matei Zaharia [mailto:ma...@cloudera.com]
 Sent: Friday, April 03, 2009 11:21 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Hadoop/HDFS for scientific simulation output data analysis

 Hi Tiankai,

 The one strange thing I see in your configuration as described is IO
 buffer
 size and IO bytes per checksum set to 64 MB. This is much higher than
 the
 recommended defaults, which are about 64 KB for buffer size and 1 KB or
 512
 bytes for checksum. (Actually I haven't seen anyone change checksum from
 its
 default of 512 bytes). Having huge buffers is bad for memory consumption
 and
 cache locality.

 The other thing that bothers me is that on your 64 MB data set, you have
 28
 TB of HDFS bytes read. This is off from number of map tasks * bytes per
 map
 by an order of magnitude. Are you sure that you've generated the data
 set
 correctly and that it's the only input path given to your job? Does
 bin/hadoop dfs -dus path to dataset come out as 1.6 TB?

 Matei

 On Sat, Mar 28, 2009 at 4:10 PM, Tu, Tiankai
 tiankai...@deshawresearch.comwrote:

  Hi,
 
  I have been exploring the feasibility of using Hadoop/HDFS to analyze
  terabyte-scale scientific simulation output datasets. After a set of
  initial experiments, I have a number of questions regarding (1) the
  configuration

RE: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Tu, Tiankai
Thanks for the update and suggestion, Matei. 

I can certainly construct an input text file containing all the files of
a dataset
(http://hadoop.apache.org/core/docs/r0.19.1/streaming.html#How+do+I+proc
ess+files%2C+one+per+map%3F), then let the jobtracker dispatch the file
names to the maps, and open the files directly from within the map
method. But the jobtracker merely treats the file names as text input
and does not make an effort to assign a file (name) to the nodes that
store the file. As a result, a node opening a file is almost certain to
request data from a different data node---which destroys IO locality
(the very strength of Hadoop) and results in worse performance. (I had
verified such behavior earlier using Hadoop streaming.)

By the way, what is the largest size---in terms of total bytes, number
of files, and number of nodes---in your applications? Thanks.


-Original Message-
From: Matei Zaharia [mailto:ma...@cloudera.com] 
Sent: Friday, April 03, 2009 1:18 PM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop/HDFS for scientific simulation output data analysis

Hadoop does checksums for each small chunk of the file (512 bytes by
default) and stores a list of checksums for each chunk with the file,
rather
than storing just one checksum at the end. This makes it easier to read
only
a part of a file and know that it's not corrupt without having to read
and
checksum the whole file. It also lets you use smaller / simpler
checksums
for each chunk, making them more efficient to compute than the larger
checksum that would be needed to provide the same level of safety for
the
whole file.

The default buffer size is confusingly not 64 KB, it's 4 KB. It really
is
bad for performance as you saw. But I'd recommend trying 64 or 128 KB
before
jumping to 64 MB. 128K is also the setting Yahoo used in its 2000-node
performance tests (see http://wiki.apache.org/hadoop/FAQ).

The reason big buffers may impair cache locality is that CPU caches are
typically a few MB. If you set your checksum size and buffer size to 64
MB,
then whenever you read a block, the CPU first has to checksum 64 MB
worth of
data then start again at the beginning to read it and pass it through
your
application. During the checksumming process, the first pages of data
fell
out of CPU cache as you checksummed the later ones. Therefore, you have
to
read them from memory again during the second scan. If you just had a 64
KB
block, it would stay in cache since the first time you read it. The same
issue happens if instead of checksumming you were copying from one
buffer to
another (which does happen at a few places in Hadoop, and they tend to
use
io.file.buffer.size). So while I haven't tried measuring performance
with 64
MB vs 128 KB, I wouldn't be surprised if it leads to bad behavior,
because
it's much higher than what anyone runs in production.

Finally, if you just want to sequentially process a file on each node
and
you only want one logical input record per map, it might be better not
to
use an input format that reads the record into memory at all. Instead,
you
can have the map directly open the file, and have your InputFormat just
locate the map on the right node. This avoids copying the whole file
into
memory before streaming it through your mapper. If your algorithm does
require random access throughout the file on the other hand, you do need
to
read it all in. I think the WholeFileRecordReader in the FAQ is aimed at
smaller files than 256 MB / 1 GB.

On Fri, Apr 3, 2009 at 9:37 AM, Tu, Tiankai
tiankai...@deshawresearch.comwrote:

 Thanks for the comments, Matei.

 The machines I ran the experiments have 16 GB memory each. I don't see
 how 64 MB buffer could be huge or is bad for memory consumption. In
 fact, I set it to much larger value after initial rounds of tests
showed
 abysmal results using the default 64 KB buffer. Also, why is it
 necessary to compute checksum for every 512 bytes why only an
end-to-end
 (whole file) checksum makes sense? I set it to a much larger value to
 avoid the overhead.

 I didn't quite understand what you meant by bad for cache locality.
The
 jobs were IO bound in the first place. Any cache effect came
second---at
 least an order of magnitude negligible. Can you clarify which
particular
 computation (maybe within Hadoop) that was made slow because of a
large
 io buffer size?

 What bothered you was exactly what bothered me and prompted me to ask
 the question why the job tracker reported much more bytes read by the
 map task. I can confirm that the experiments were set up correctly. In
 fact, the numbers of map tasks were correctly reported by the job
 tracker. There were 1600 for the 1GB file dataset, 6400 for the 256MB
 file dataset, and so forth.

 Tiankai



 -Original Message-
 From: Matei Zaharia [mailto:ma...@cloudera.com]
 Sent: Friday, April 03, 2009 11:21 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Hadoop/HDFS for scientific simulation output data
analysis

 Hi

Re: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Owen O'Malley

On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote:


By the way, what is the largest size---in terms of total bytes, number
of files, and number of nodes---in your applications? Thanks.


The largest Hadoop application that has been documented is the Yahoo  
Webmap.


10,000 cores
500 TB shuffle
300 TB compressed final output

http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html

-- Owen