Hey Jason,
Thanks, I'll keep this on hand as I do more tests. I now have a C,
Java, and python version of my testing program ;)
However, I particularly *like* the fact that there's caching going on
- it'll help out our application immensely, I think. I'll be looking
at the performance both with and without the cache.
Brian
On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:
The following very simple program will tell the VM to drop the pages
being
cached for a file. I tend to spin this in a for loop when making
large tar
files, or otherwise working with large files, and the system
performance
really smooths out.
Since it use open(path) it will churn through the inode cache and
directories.
Something like this might actually significantly speed up HDFS by
running
over the blocks on the datanodes, for busy clusters.
#define _XOPEN_SOURCE 600
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
/** Simple program to dump buffered data for specific files from the
buffer
cache. Copyright Jason Venner 2009, License GPL*/
int main( int argc, char** argv )
{
int failCount = 0;
int i;
for( i = 1; i < argc; i++ ) {
char* file = argv[i];
int fd = open( file, O_RDONLY|O_LARGEFILE );
if (fd == -1) {
perror( file );
failCount++;
continue;
}
if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
fprintf( stderr, "Failed to flush cache for %s %s\n",
argv[optind],
strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
failCount++;
}
close(fd);
}
exit( failCount );
}
On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey
<sc...@richrelevance.com>wrote:
On 4/12/09 9:41 PM, "Brian Bockelman" <bbock...@cse.unl.edu> wrote:
Ok, here's something perhaps even more strange. I removed the
"seek"
part out of my timings, so I was only timing the "read" instead of
the
"seek + read" as in the first case. I also turned the read-ahead
down
to 1-byte (aka, off).
The jump *always* occurs at 128KB, exactly.
Some random ideas:
I have no idea how FUSE interops with the Linux block layer, but 128K
happens to be the default 'readahead' value for block devices,
which may
just be a coincidence.
For a disk 'sda', you check and set the value (in 512 byte blocks)
with:
/sbin/blockdev --getra /dev/sda
/sbin/blockdev --setra [num blocks] /dev/sda
I know on my file system tests, the OS readahead is not activated
until a
series of sequential reads go through the block device, so truly
random
access is not affected by this. I've set it to 128MB and random
iops does
not change on a ext3 or xfs file system. If this applies to FUSE
too,
there
may be reasons that this behavior differs.
Furthermore, one would not expect it to be slower to randomly read
4k than
randomly read up to the readahead size itself even if it did.
I also have no idea how much of the OS device queue and block device
scheduler is involved with FUSE. If those are involved, then
there's a
bunch of stuff to tinker with there as well.
Lastly, an FYI if you don't already know the following. If the OS is
caching pages, there is a way to flush these in Linux to evict the
cache.
See /proc/sys/vm/drop_caches .
I'm a bit befuddled. I know we say that HDFS is optimized for
large,
sequential reads, not random reads - but it seems that it's one bug-
fix away from being a good general-purpose system. Heck if I can
find
what's causing the issues though...
Brian
--
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422