Andy's points are reasonable but there are a few omissions, - modern file systems are pretty good at writing large files into contiguous blocks if they have a reasonable amount of space available.
- the seeks in question are likely to be more to do with checking directories for block locations than seeking to small-ish file starts because modern file systems tend to group together files that are written at about the same time. - it is quite possible to build an HDFS-like file system that uses very small blocks. There really are three considerations here that, when conflated, make the design more difficult than necessary. These three concepts are: the primitive unit of disk allocation This is the size of disk allocation. For HDFS, this is variable in size since blocks can be smaller than the max size. The key problem with a large size here is that it is relatively difficult to allow quick reading of the file during writing. With a smaller block size, the block can be committed in a way that the reader can read it much sooner. Extremely large block sizes also make R/W file systems and snapshots more difficult for basically the same reason. There is no strong reason that this has to be conflated with the striping chunk size. Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS knows nothing about the blocks in the underlying system, you don't get the benefit. the unit of node striping This is the size of data that is sent to each node and is intended to achieve read parallelism in map-reduce programs. This should be large enough to cause a map task to take a reasonable time to process in order to make task scheduling easier. A few hundred megabytes is commonly a good size, but different applications may prefer sizes as small as a MB or as large as a few GB. the unit of scaling It is typical that something somewhere needs to remember what gets stuck where in the cluster. Currently the name node does this with blocks. Blocks are a bad choice here because they come and go quite often which means that the namenode has to handle lots of changes and because this makes caching of the name node data or persisting it to disk much harder. Blocks also tend to limit scaling because you have to have so many of them in a large system. A counter-example to the design of HDFS is the MapR architecture. There, the disk blocks are 8K, chunks are a few hundred megabytes (but flexible within a single cluster) and the scaling unit is 10's of gigabytes. Separating these concepts allows disk contiguity, efficient node striping and simple HA of the file system. On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <a...@cloudera.com> wrote: > On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pan...@brightroll.com> > wrote: > > The Hadoop Definitive Guide provides comparison with regular file systems > > and indicates the advantage being lower number of seeks(as far as I > > understood it, may be I read it incorreclty, if so I apologize). But, as > I > > understand, the data node stores data on a regular file system. If this > is > > so then how does having a bigger HDFS block size provide better seek > > performance, when the data will ultimately be read from regular file > system > > which has much smaller block size. > > Suppose that HDFS stored data in smaller blocks (64kb for example). > Then ext4 would have no reason to put those small files close together > on disk, and reading from a HDFS file would mean reading from very > many ext4 files, and probably would mean many seeks. > > The large block size design of HDFS avoids that problem by giving ext4 > the information it needs to optimize for our desired use case. > > > I see other advantages of bigger block size though: > > > > Less entries on NameNode to keep track of > > That's another benefit. > > > Less switching from datanode to datanode for the HDFS client when > fetching > > the file. If block size were small, just this switching would reduce the > > performance a lot. Perhaps this is the seek that the definitive guide > refers > > to. > > If one were building HDFS with a smaller block size, you'd probably > have to overlap block fetches from many data nodes in order to get > decent performance. So yes, this "switching" as you term it would be a > performance bottleneck. > > > Less overhead cost of setting up Map tasks. The way MR usually works is > that > > one Map task is created per block. Smaller block will mean less > computation > > per map task and thus overhead of setting up the map task would become > > significant. > > A MR designed for a small-block-HDFS would probably have to do > something different rather than one mapper per block. > > > I want to make sure I understand the advantages of having a larger block > > size. I specifically want to know whether there is any advantage in > terms of > > disk seeks; that one thing has got me very confused. > > Seems like you have a pretty good understanding of the issues, and I > hope I clarified the seek issue above. > > -andy >