In practise (at linkedin), how long do you see the calls blocked for during
fsycs?

On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> One issue with using the filesystem for persistence is that the
> synchronization in the filesystem is not great. In particular the fsync and
> fsyncdata system calls block appends to the file, apparently for the entire
> duration of the fsync (which can be quite long). This is documented in some
> detail here:
>  http://antirez.com/post/fsync-different-thread-useless.html
>
> This is a problem in 0.7 because our definition of a committed message is
> one written prior to calling fsync(). This is the only way to guarantee the
> message is on disk. We do not hand out any messages to consumers until an
> fsync call occurs. The problem is that regardless of whether the fsync is
> in a background thread or not it will block any produce requests to the
> file. This is buffered a bit in the client since our produce request is
> effectively async in 0.7, but it can lead to weird latency spikes
> nontheless as this buffering gets filled.
>
> In 0.8 with replication the definition of a committed message changes to
> one that is replicated to multiple machines, not necessarily committed to
> disk. This is a different kind of guarantee with different strengths and
> weaknesses (pro: data can survive destruction of the file system on one
> machine, con: you will lose a few messages if you haven't sync'd and the
> power goes out). We will likely retain the flush interval and time settings
> for those who want fine grained control over flushing, but it is less
> relevant.
>
> Unfortunately *any* call to fsync will block appends even in a background
> thread so how can we give control over physical disk persistence without
> introducing high latency for the producer? The answer is that the linux
> pdflush daemon actually does a very similar thing to our flush parameters.
> pdflush is a daemon running on every linux machine that controls the
> writing of buffered/cached data back to disk. It allows you to control the
> percentage of memory filled with dirty pages by giving it either a
> percentage of memory, a time out for any dirty page to be written, or a
> fixed number of dirty bytes.
>
> The question is, does pdflush block appends? The answer seems to be mostly
> no. It locks the page being flushed but not the whole file. The time to
> flush one page is actually usually pretty quick (plus I think it may not be
> flushing just written pages anyway). I wrote some test code for this and
> here are the results:
>
> I modified the code from the link above. Here are the results from my
> desktop (Centos Linux 2.6.32).
>
> We run the test writing 1024 bytes every 100 us and flushing every 500 us:
>
> $ ./pdflush-test 1024 100 500
> 21
> 4
> 3
> 3
> 9
> 6
> Sync in 20277 us (0), sleeping for 500 us
> 19819
> 7
> 7
> 8
> 38
> Sync in 19470 us (0), sleeping for 500 us
> 19048
> 7
> 4
> 3
> 8
> 4
> Sync in 19405 us (0), sleeping for 500 us
> 19017
> 6
> 6
> 10
> 6
> Sync in 19410 us (0), sleeping for 500 us
> 19025
> 7
> 7
> 11
> 6
>
> $ cat /proc/sys/vm/dirty_writeback_centisecs
> 100
> $ cat /proc/sys/vm/dirty_expire_centisecs
> 500
>
> Now run the test with the background flush disabled (rarely running):
> $ ./pdflush-test 1024 100 5000000000000 > times.txt
>
> I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us
> and the max time was 2043 us (2ms).
>
> Here is the test code:
>
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/types.h>
> #include <pthread.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/time.h>
> #include <stdlib.h>
>
> static long long microseconds(void) {
>    struct timeval tv;
>    long long mst;
>
>    gettimeofday(&tv, NULL);
>    mst = ((long long)tv.tv_sec)*1000000;
>    mst += tv.tv_usec;
>    return mst;
> }
>
> void *IOThreadEntryPoint(void *arg) {
>    int fd, retval;
>    long long start;
>    long sleep = (long) arg;
>
>    while(1) {
>        usleep(sleep);
>        start = microseconds();
>        fd = open("/tmp/foo.txt",O_RDONLY);
>        retval = fsync(fd);
>        close(fd);
>        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> microseconds()-start, retval, sleep);
>    }
>    return NULL;
> }
>
> int main(int argc, char* argv[]) {
>    if(argc != 4) {
>      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
>      exit(1);
>    }
>
>    pthread_t thread;
>    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
>    long long start;
>    long long ellapsed;
>    int size = atoi(argv[1]);
>    long write_sleep = atol(argv[2]);
>    long fsync_sleep = atol(argv[3]);
>    char buff[size];
>
>    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
>
>    while(1) {
>        start = microseconds();
>        if (write(fd,buff,size) == -1) {
>            perror("write");
>            exit(1);
>        }
>        ellapsed = microseconds()-start;
>        printf("%lld\n", ellapsed);
>        usleep(write_sleep);
>    }
>    close(fd);
>    exit(0);
> }
>
> Cheers,
>
> -Jay
>

Reply via email to