Hi everyone, Currently, appending data to a file opened with APR_FOPEN_APPEND that has been locked by apr_file_lock() will cause a deadlock on Windows.
[See PR50058, https://bz.apache.org/bugzilla/show_bug.cgi?id=50058] This issue happens because atomic O_APPEND-style appends on Windows are implemented using file locks. An append happens while holding the file lock acquired with LockFile(), which is required to avoid a race condition between seeking to the end of file and writing data. The race is possible when multiple threads or processes are appending data to the same file. This approach causes a deadlock if the file has been previously locked with apr_file_lock(). (Note that it's perfectly legit to lock the file or its portion and perform the append after that.) Apart from this, using file locks for file appends impacts their speed and robustness. There is an overhead associated with locking the file, especially if the file is not local. The robustness is affected, because other writes to the same file may fail due to it being locked. Also, if a process is terminated in the middle of the append operation, it might take some time for the OS to release the file lock. During this time, the file would be inaccessible to other processes. This may affect applications such as httpd (with a multi-process MPM) that use APR_FOPEN_APPEND files for logging. This patch series fixes the issue by switching to the documented way to atomically append data with a single WriteFile() call. It requires passing special OVERLAPPED.Offset and OffsetHigh values (0xFFFFFFFF). On the ZwWriteFile() layer, this maps to a FILE_WRITE_TO_END_OF_FILE constant that instructs the OS (the corresponding file system driver) to write data to the file's end. Note that this approach is only used for files opened for synchronous I/O because in this case the I/O Manager maintains the current file position. Otherwise, the file offset returned or changed by the SetFilePointer() API is not guaranteed to be valid and that could, for instance, break apr_file_seek() calls after appending data. Sadly, if a file is opened for asynchronous I/O, this call to WriteFile() doesn't update the OVERLAPPED.Offset member to reflect the actual offset used when appending the data (which we could then use to make seeking and other operations involving filePtr work). Therefore, when appending to files opened for asynchronous I/O, we still use the old LockFile + SetFilePointer + WriteFile approach. Additional details on this can be found in: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365747 https://msdn.microsoft.com/en-us/library/windows/hardware/ff567121 The implementation is split into four dependent patches. The first three patches are minor refactorings to the apr_file_write() function and lay the necessary groundwork for the core change. The fourth patch is the core change that fixes the issue in the described way and also includes the tests. The log messages are included in the beginning of each patch file. Thanks, Evgeny Kotkov
Minor refactoring of the Win32 file write code. * file_io/win32/readwrite.c (apr_file_write): Reuse the existing 'rv' variable to store the status code. Reduce the scope of the 'offset' variable. Patch by: Evgeny Kotkov <evgeny.kotkov {at} visualsvn.com> Index: file_io/win32/readwrite.c =================================================================== --- file_io/win32/readwrite.c (revision 1806361) +++ file_io/win32/readwrite.c (working copy) @@ -372,9 +372,9 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ return rv; } else { if (!thefile->pipe) { - apr_off_t offset = 0; - apr_status_t rc; if (thefile->append) { + apr_off_t offset = 0; + /* apr_file_lock will mutex the file across processes. * The call to apr_thread_mutex_lock is added to avoid * a race condition between LockFile and WriteFile @@ -381,15 +381,15 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ * that occasionally leads to deadlocked threads. */ apr_thread_mutex_lock(thefile->mutex); - rc = apr_file_lock(thefile, APR_FLOCK_EXCLUSIVE); - if (rc != APR_SUCCESS) { + rv = apr_file_lock(thefile, APR_FLOCK_EXCLUSIVE); + if (rv != APR_SUCCESS) { apr_thread_mutex_unlock(thefile->mutex); - return rc; + return rv; } - rc = apr_file_seek(thefile, APR_END, &offset); - if (rc != APR_SUCCESS) { + rv = apr_file_seek(thefile, APR_END, &offset); + if (rv != APR_SUCCESS) { apr_thread_mutex_unlock(thefile->mutex); - return rc; + return rv; } } if (thefile->pOverlapped) {
Minor refactoring of the Win32 file write code. * file_io/win32/readwrite.c (apr_file_write): Update filePtr in the asynchronous I/O mode after the WriteFile() call. This allows simplifying the condition by not checking for thefile->pipe property. Patch by: Evgeny Kotkov <evgeny.kotkov {at} visualsvn.com> Index: file_io/win32/readwrite.c =================================================================== --- file_io/win32/readwrite.c (revision 1806361) +++ file_io/win32/readwrite.c (working copy) @@ -398,6 +398,9 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ } rv = WriteFile(thefile->filehand, buf, (DWORD)*nbytes, &bwrote, thefile->pOverlapped); + if (rv == APR_SUCCESS && thefile->pOverlapped) { + thefile->filePtr += *nbytes; + } if (thefile->append) { apr_file_unlock(thefile); apr_thread_mutex_unlock(thefile->mutex); @@ -453,9 +456,6 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ } } } - if (rv == APR_SUCCESS && thefile->pOverlapped && !thefile->pipe) { - thefile->filePtr += *nbytes; - } } return rv; }
Minor refactoring of the Win32 file write code. * file_io/win32/readwrite.c (apr_file_write): Invert the if-else clauses to test against the positive condition. Patch by: Evgeny Kotkov <evgeny.kotkov {at} visualsvn.com> Index: file_io/win32/readwrite.c =================================================================== --- file_io/win32/readwrite.c (revision 1806361) +++ file_io/win32/readwrite.c (working copy) @@ -371,7 +371,11 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ } return rv; } else { - if (!thefile->pipe) { + if (thefile->pipe) { + rv = WriteFile(thefile->filehand, buf, (DWORD)*nbytes, &bwrote, + thefile->pOverlapped); + } + else { if (thefile->append) { apr_off_t offset = 0; @@ -406,10 +410,6 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ apr_thread_mutex_unlock(thefile->mutex); } } - else { - rv = WriteFile(thefile->filehand, buf, (DWORD)*nbytes, &bwrote, - thefile->pOverlapped); - } if (rv) { *nbytes = bwrote; rv = APR_SUCCESS;
Win32: Fix a deadlock when appending to locked files (PR50058). Currently, appending data to a file opened with APR_FOPEN_APPEND that has been locked by apr_file_lock() will cause a deadlock on Windows. [See PR50058, https://bz.apache.org/bugzilla/show_bug.cgi?id=50058] This issue happens because atomic O_APPEND-style appends on Windows are implemented using file locks. An append happens while holding the file lock acquired with LockFile(), which is required to avoid a race condition between seeking to the end of file and writing data. The race is possible when multiple threads or processes are appending data to the same file. This approach causes a deadlock if the file has been previously locked with apr_file_lock(). (Note that it's perfectly legit to lock the file or its portion and perform the append after that.) Apart from this, using file locks for file appends impacts their speed and robustness. There is an overhead associated with locking the file, especially if the file is not local. The robustness is affected, because other writes to the same file may fail due to it being locked. Also, if a process is terminated in the middle of the append operation, it might take some time for the OS to release the file lock. During this time, the file would be inaccessible to other processes. This may affect applications such as httpd (with a multi-process MPM) that use APR_FOPEN_APPEND files for logging. This patch fixes the issue by switching to the documented way to atomically append data with a single WriteFile() call. It requires passing special OVERLAPPED.Offset and OffsetHigh values (0xFFFFFFFF). On the ZwWriteFile() layer, this maps to a FILE_WRITE_TO_END_OF_FILE constant that instructs the OS (the corresponding file system driver) to write data to the file's end. Note that this approach is only used for files opened for synchronous I/O because in this case the I/O Manager maintains the current file position. Otherwise, the file offset returned or changed by the SetFilePointer() API is not guaranteed to be valid and that could, for instance, break apr_file_seek() calls after appending data. Sadly, if a file is opened for asynchronous I/O, this call to WriteFile() doesn't update the OVERLAPPED.Offset member to reflect the actual offset used when appending the data (which we could then use to make seeking and other operations involving filePtr work). Therefore, when appending to files opened for asynchronous I/O, we still use the old LockFile + SetFilePointer + WriteFile approach. Additional details on this can be found in: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365747 https://msdn.microsoft.com/en-us/library/windows/hardware/ff567121 * file_io/win32/readwrite.c (apr_file_write): For files opened for synchronous I/O, use the documented way to perform an atomic append with a single WriteFile(). * test/testfile.c (): Include apr_thread_proc.h and apr_strings_.h (struct thread_file_append_ctx_t, thread_file_append_func): New test helpers. (test_atomic_append, test_append_locked): New tests. (testfile): Run the new tests. Patch by: Evgeny Kotkov <evgeny.kotkov {at} visualsvn.com> Index: file_io/win32/readwrite.c =================================================================== --- file_io/win32/readwrite.c (revision 1806361) +++ file_io/win32/readwrite.c (working copy) @@ -375,6 +375,39 @@ APR_DECLARE(apr_status_t) apr_file_write(apr_file_ rv = WriteFile(thefile->filehand, buf, (DWORD)*nbytes, &bwrote, thefile->pOverlapped); } + else if (thefile->append && !thefile->pOverlapped) { + OVERLAPPED ov = {0}; + + /* If the file is opened for synchronous I/O, take advantage of the + * documented way to atomically append data by calling WriteFile() + * with both the OVERLAPPED.Offset and OffsetHigh members set to + * 0xFFFFFFFF. This avoids calling LockFile() that is otherwise + * required to avoid a race condition between seeking to the end + * and writing data. Not locking the file improves robustness of + * such appends and avoids a deadlock when appending to an already + * locked file, as described in PR50058. + * + * We use this approach only for files opened for synchronous I/O + * because in this case the I/O Manager maintains the current file + * position. Otherwise, the file offset returned or changed by + * the SetFilePointer() API is not guaranteed to be valid and that + * could, for instance, break apr_file_seek() calls after appending + * data. Sadly, if a file is opened for asynchronous I/O, this + * call doesn't update the OVERLAPPED.Offset member to reflect the + * actual offset used when appending the data (which we could then + * use to make seeking and other operations involving filePtr work). + * Therefore, when appending to files opened for asynchronous I/O, + * we still use the LockFile + SetFilePointer + WriteFile approach. + * + * References: + * https://bz.apache.org/bugzilla/show_bug.cgi?id=50058 + * https://msdn.microsoft.com/en-us/library/windows/desktop/aa365747 + * https://msdn.microsoft.com/en-us/library/windows/hardware/ff567121 + */ + ov.Offset = MAXDWORD; + ov.OffsetHigh = MAXDWORD; + rv = WriteFile(thefile->filehand, buf, (DWORD)*nbytes, &bwrote, &ov); + } else { if (thefile->append) { apr_off_t offset = 0; Index: test/testfile.c =================================================================== --- test/testfile.c (revision 1806361) +++ test/testfile.c (working copy) @@ -21,6 +21,8 @@ #include "apr_general.h" #include "apr_poll.h" #include "apr_lib.h" +#include "apr_strings.h" +#include "apr_thread_proc.h" #include "testutil.h" #define DIRNAME "data" @@ -1628,6 +1630,169 @@ static void test_write_buffered_spanning_over_bufs apr_file_remove(fname, p); } +typedef struct thread_file_append_ctx_t { + apr_pool_t *pool; + const char *fname; + apr_size_t chunksize; + char val; + int num_writes; + char *errmsg; +} thread_file_append_ctx_t; + +static void * APR_THREAD_FUNC thread_file_append_func(apr_thread_t *thd, void *data) +{ + thread_file_append_ctx_t *ctx = data; + apr_status_t rv; + apr_file_t *f; + int i; + char *writebuf; + char *readbuf; + + rv = apr_file_open(&f, ctx->fname, + APR_FOPEN_READ | APR_FOPEN_WRITE | APR_FOPEN_APPEND, + APR_FPROT_OS_DEFAULT, ctx->pool); + if (rv) { + apr_thread_exit(thd, rv); + return NULL; + } + + writebuf = apr_palloc(ctx->pool, ctx->chunksize); + memset(writebuf, ctx->val, ctx->chunksize); + readbuf = apr_palloc(ctx->pool, ctx->chunksize); + + for (i = 0; i < ctx->num_writes; i++) { + apr_size_t bytes_written; + apr_size_t bytes_read; + apr_off_t offset; + + rv = apr_file_write_full(f, writebuf, ctx->chunksize, &bytes_written); + if (rv) { + apr_thread_exit(thd, rv); + return NULL; + } + /* After writing the data, seek back from the current offset and + * verify what we just wrote. */ + offset = -((apr_off_t)ctx->chunksize); + rv = apr_file_seek(f, APR_CUR, &offset); + if (rv) { + apr_thread_exit(thd, rv); + return NULL; + } + rv = apr_file_read_full(f, readbuf, ctx->chunksize, &bytes_read); + if (rv) { + apr_thread_exit(thd, rv); + return NULL; + } + if (memcmp(readbuf, writebuf, ctx->chunksize) != 0) { + ctx->errmsg = apr_psprintf( + ctx->pool, + "Unexpected data at file offset %" APR_OFF_T_FMT, + offset); + apr_thread_exit(thd, APR_SUCCESS); + return NULL; + } + } + + apr_file_close(f); + apr_thread_exit(thd, APR_SUCCESS); + + return NULL; +} + +static void test_atomic_append(abts_case *tc, void *data) +{ + apr_status_t rv; + apr_status_t thread_rv; + apr_file_t *f; + const char *fname = "data/testatomic_append.dat"; + unsigned int seed; + thread_file_append_ctx_t ctx1 = {0}; + thread_file_append_ctx_t ctx2 = {0}; + apr_thread_t *t1; + apr_thread_t *t2; + + apr_file_remove(fname, p); + + rv = apr_file_open(&f, fname, APR_FOPEN_WRITE | APR_FOPEN_CREATE, + APR_FPROT_OS_DEFAULT, p); + APR_ASSERT_SUCCESS(tc, "create file", rv); + apr_file_close(f); + + seed = (unsigned int)apr_time_now(); + abts_log_message("Random seed for test_atomic_append() is %u", seed); + srand(seed); + + /* Create two threads appending data to the same file. */ + apr_pool_create(&ctx1.pool, p); + ctx1.fname = fname; + ctx1.chunksize = 1 + rand() % 8192; + ctx1.val = 'A'; + ctx1.num_writes = 1000; + rv = apr_thread_create(&t1, NULL, thread_file_append_func, &ctx1, p); + APR_ASSERT_SUCCESS(tc, "create thread", rv); + + apr_pool_create(&ctx2.pool, p); + ctx2.fname = fname; + ctx2.chunksize = 1 + rand() % 8192; + ctx2.val = 'B'; + ctx2.num_writes = 1000; + rv = apr_thread_create(&t2, NULL, thread_file_append_func, &ctx2, p); + APR_ASSERT_SUCCESS(tc, "create thread", rv); + + rv = apr_thread_join(&thread_rv, t1); + APR_ASSERT_SUCCESS(tc, "join thread", rv); + APR_ASSERT_SUCCESS(tc, "no thread errors", thread_rv); + if (ctx1.errmsg) { + ABTS_FAIL(tc, ctx1.errmsg); + } + rv = apr_thread_join(&thread_rv, t2); + APR_ASSERT_SUCCESS(tc, "join thread", rv); + APR_ASSERT_SUCCESS(tc, "no thread errors", thread_rv); + if (ctx2.errmsg) { + ABTS_FAIL(tc, ctx2.errmsg); + } + + apr_file_remove(fname, p); +} + +static void test_append_locked(abts_case *tc, void *data) +{ + apr_status_t rv; + apr_file_t *f; + const char *fname = "data/testappend_locked.dat"; + apr_size_t bytes_written; + apr_size_t bytes_read; + char buf[64] = {0}; + + apr_file_remove(fname, p); + + rv = apr_file_open(&f, fname, + APR_FOPEN_WRITE | APR_FOPEN_CREATE | APR_FOPEN_APPEND, + APR_FPROT_OS_DEFAULT, p); + APR_ASSERT_SUCCESS(tc, "create file", rv); + + rv = apr_file_lock(f, APR_FLOCK_EXCLUSIVE); + APR_ASSERT_SUCCESS(tc, "lock file", rv); + + /* PR50058: Appending to a locked file should not deadlock. */ + rv = apr_file_write_full(f, "abc", 3, &bytes_written); + APR_ASSERT_SUCCESS(tc, "write to file", rv); + + apr_file_unlock(f); + apr_file_close(f); + + rv = apr_file_open(&f, fname, APR_FOPEN_READ, APR_FPROT_OS_DEFAULT, p); + APR_ASSERT_SUCCESS(tc, "open file", rv); + + rv = apr_file_read_full(f, buf, sizeof(buf), &bytes_read); + ABTS_INT_EQUAL(tc, APR_EOF, rv); + ABTS_INT_EQUAL(tc, 3, (int)bytes_read); + ABTS_STR_EQUAL(tc, "abc", buf); + + apr_file_close(f); + apr_file_remove(fname, p); +} + abts_suite *testfile(abts_suite *suite) { suite = ADD_SUITE(suite) @@ -1681,6 +1847,8 @@ abts_suite *testfile(abts_suite *suite) abts_run_test(suite, test_two_large_writes_buffered, NULL); abts_run_test(suite, test_small_and_large_writes_buffered, NULL); abts_run_test(suite, test_write_buffered_spanning_over_bufsize, NULL); + abts_run_test(suite, test_atomic_append, NULL); + abts_run_test(suite, test_append_locked, NULL); return suite; }