The following very simple program will tell the VM to drop the pages being
cached for a file. I tend to spin this in a for loop when making large tar
files, or otherwise working with large files, and the system performance
really smooths out.
Since it use open(path) it will churn through the inode cache and
directories.
Something like this might actually significantly speed up HDFS by running
over the blocks on the datanodes, for busy clusters.


#define _XOPEN_SOURCE 600
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/** Simple program to dump buffered data for specific files from the buffer
cache. Copyright Jason Venner 2009, License GPL*/

int main( int argc, char** argv )
{
  int failCount = 0;
  int i;
  for( i = 1; i < argc; i++ ) {
    char* file = argv[i];
    int fd = open( file, O_RDONLY|O_LARGEFILE );
    if (fd == -1) {
      perror( file );
      failCount++;
      continue;
    }
    if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
      fprintf( stderr, "Failed to flush cache for %s %s\n", argv[optind],
strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
      failCount++;
    }
    close(fd);
  }
  exit( failCount );
}


On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey <sc...@richrelevance.com>wrote:

>
> On 4/12/09 9:41 PM, "Brian Bockelman" <bbock...@cse.unl.edu> wrote:
>
> > Ok, here's something perhaps even more strange.  I removed the "seek"
> > part out of my timings, so I was only timing the "read" instead of the
> > "seek + read" as in the first case.  I also turned the read-ahead down
> > to 1-byte (aka, off).
> >
> > The jump *always* occurs at 128KB, exactly.
>
> Some random ideas:
>
> I have no idea how FUSE interops with the Linux block layer, but 128K
> happens to be the default 'readahead' value for block devices, which may
> just be a coincidence.
>
> For a disk 'sda', you check and set the value (in 512 byte blocks) with:
>
> /sbin/blockdev --getra /dev/sda
> /sbin/blockdev --setra [num blocks] /dev/sda
>
>
> I know on my file system tests, the OS readahead is not activated until a
> series of sequential reads go through the block device, so truly random
> access is not affected by this.  I've set it to 128MB and random iops does
> not change on a ext3 or xfs file system.  If this applies to FUSE too,
> there
> may be reasons that this behavior differs.
> Furthermore, one would not expect it to be slower to randomly read 4k than
> randomly read up to the readahead size itself even if it did.
>
> I also have no idea how much of the OS device queue and block device
> scheduler is involved with FUSE.  If those are involved, then there's a
> bunch of stuff to tinker with there as well.
>
> Lastly, an FYI if you don't already know the following.  If the OS is
> caching pages, there is a way to flush these in Linux to evict the cache.
> See /proc/sys/vm/drop_caches .
>
>
>
> >
> > I'm a bit befuddled.  I know we say that HDFS is optimized for large,
> > sequential reads, not random reads - but it seems that it's one bug-
> > fix away from being a good general-purpose system.  Heck if I can find
> > what's causing the issues though...
> >
> > Brian
> >
> >
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to