In practise (at linkedin), how long do you see the calls blocked for during fsycs?
On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > One issue with using the filesystem for persistence is that the > synchronization in the filesystem is not great. In particular the fsync and > fsyncdata system calls block appends to the file, apparently for the entire > duration of the fsync (which can be quite long). This is documented in some > detail here: > http://antirez.com/post/fsync-different-thread-useless.html > > This is a problem in 0.7 because our definition of a committed message is > one written prior to calling fsync(). This is the only way to guarantee the > message is on disk. We do not hand out any messages to consumers until an > fsync call occurs. The problem is that regardless of whether the fsync is > in a background thread or not it will block any produce requests to the > file. This is buffered a bit in the client since our produce request is > effectively async in 0.7, but it can lead to weird latency spikes > nontheless as this buffering gets filled. > > In 0.8 with replication the definition of a committed message changes to > one that is replicated to multiple machines, not necessarily committed to > disk. This is a different kind of guarantee with different strengths and > weaknesses (pro: data can survive destruction of the file system on one > machine, con: you will lose a few messages if you haven't sync'd and the > power goes out). We will likely retain the flush interval and time settings > for those who want fine grained control over flushing, but it is less > relevant. > > Unfortunately *any* call to fsync will block appends even in a background > thread so how can we give control over physical disk persistence without > introducing high latency for the producer? The answer is that the linux > pdflush daemon actually does a very similar thing to our flush parameters. > pdflush is a daemon running on every linux machine that controls the > writing of buffered/cached data back to disk. It allows you to control the > percentage of memory filled with dirty pages by giving it either a > percentage of memory, a time out for any dirty page to be written, or a > fixed number of dirty bytes. > > The question is, does pdflush block appends? The answer seems to be mostly > no. It locks the page being flushed but not the whole file. The time to > flush one page is actually usually pretty quick (plus I think it may not be > flushing just written pages anyway). I wrote some test code for this and > here are the results: > > I modified the code from the link above. Here are the results from my > desktop (Centos Linux 2.6.32). > > We run the test writing 1024 bytes every 100 us and flushing every 500 us: > > $ ./pdflush-test 1024 100 500 > 21 > 4 > 3 > 3 > 9 > 6 > Sync in 20277 us (0), sleeping for 500 us > 19819 > 7 > 7 > 8 > 38 > Sync in 19470 us (0), sleeping for 500 us > 19048 > 7 > 4 > 3 > 8 > 4 > Sync in 19405 us (0), sleeping for 500 us > 19017 > 6 > 6 > 10 > 6 > Sync in 19410 us (0), sleeping for 500 us > 19025 > 7 > 7 > 11 > 6 > > $ cat /proc/sys/vm/dirty_writeback_centisecs > 100 > $ cat /proc/sys/vm/dirty_expire_centisecs > 500 > > Now run the test with the background flush disabled (rarely running): > $ ./pdflush-test 1024 100 5000000000000 > times.txt > > I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us > and the max time was 2043 us (2ms). > > Here is the test code: > > #include <stdio.h> > #include <unistd.h> > #include <string.h> > #include <sys/types.h> > #include <pthread.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <sys/time.h> > #include <stdlib.h> > > static long long microseconds(void) { > struct timeval tv; > long long mst; > > gettimeofday(&tv, NULL); > mst = ((long long)tv.tv_sec)*1000000; > mst += tv.tv_usec; > return mst; > } > > void *IOThreadEntryPoint(void *arg) { > int fd, retval; > long long start; > long sleep = (long) arg; > > while(1) { > usleep(sleep); > start = microseconds(); > fd = open("/tmp/foo.txt",O_RDONLY); > retval = fsync(fd); > close(fd); > printf("Sync in %lld us (%d), sleeping for %ld us\n", > microseconds()-start, retval, sleep); > } > return NULL; > } > > int main(int argc, char* argv[]) { > if(argc != 4) { > printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]); > exit(1); > } > > pthread_t thread; > int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644); > long long start; > long long ellapsed; > int size = atoi(argv[1]); > long write_sleep = atol(argv[2]); > long fsync_sleep = atol(argv[3]); > char buff[size]; > > pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep); > > while(1) { > start = microseconds(); > if (write(fd,buff,size) == -1) { > perror("write"); > exit(1); > } > ellapsed = microseconds()-start; > printf("%lld\n", ellapsed); > usleep(write_sleep); > } > close(fd); > exit(0); > } > > Cheers, > > -Jay >