Re: httpd 2.4.25, mpm_event, ssl: segfaults
All writes to Linux sockets means the kernel copies to 2kiB buffers used by SKBs. It's copied to somewhere in the middle of that 2kiB buffer, so that TCP/IP headers can be prepended by the kernel. Even with TCP Segmentation Offload, 2kiB buffers are still used; it just means that the TCP/IP headers just need to be calculated once for an array of buffers, and then the kernel puts an array of pointers in the network card's ring buffer. The kernel will only put on the wire as much data as the current TCP congestion window says, but it has to keep each packet in it's buffers until the remote side ACKs that packet. On Mon, Feb 27, 2017 at 2:25 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote: > On Mon, Feb 27, 2017 at 12:16 PM, Jacob Champion <champio...@gmail.com> > wrote: > > > > On 02/23/2017 04:48 PM, Yann Ylavic wrote: > >> On Wed, Feb 22, 2017 at 8:55 PM, Daniel Lescohier wrote: > >>> > >>> > >>> IOW: read():Three copies: copy from filesystem cache to httpd > >>> read() buffer to encrypted-data buffer to kernel socket buffer. > > > >> > >> Not really, "copy from filesystem cache to httpd read() buffer" is > >> likely mapping to userspace, so no copy (on read) here. > > > > Oh, cool. Which kernels do this? It seems like the VM tricks would have > to > > be incredibly intricate for this to work; reads typically don't happen in > > page-sized chunks, nor to aligned addresses. Linux in particular has > > comments in the source explaining that they *don't* do it for other > syscalls > > (e.g. vmsplice)... but I don't have much experience with non-Linux > systems. > > I don't understand this claim. > > If read() returned an API-provisioned buffer, it could point wherever it > liked, > including a 4k page. As things stand the void* (or char*) of the read() > buffer > is at an arbitrary offset, no common OS I'm familiar with maps a page to > a non-page-aligned address. > > The kernel socket send[v]() call might avoid copy in the direct-send case, > depending on the implementation. >
Re: httpd 2.4.25, mpm_event, ssl: segfaults
On Thu, Feb 23, 2017 at 7:48 PM, Yann Ylavic <ylavic@gmail.com> wrote: > On Wed, Feb 22, 2017 at 8:55 PM, Daniel Lescohier wrote: > > > > IOW: > > read():Three copies: copy from filesystem cache to httpd read() buffer to > > encrypted-data buffer to kernel socket buffer. > > Not really, "copy from filesystem cache to httpd read() buffer" is > likely mapping to userspace, so no copy (on read) here. > > > mmap(): Two copies: filesystem page already mapped into httpd, so just > copy > > from filesystem (cached) page to encrypted-data buffer to kernel socket > > buffer. > > So, as you said earlier the "write to socket" isn't a copy either, > hence both read() and mmap() implementations could work with a single > copy when mod_ssl is involved (this is more than a copy but you > counted it above so), and no copy at all without it. > When you do a write() system call to a socket, the kernel must copy the data from the userspace buffer to it's own buffers, because of data lifetime. When the write() system call returns, userspace is free to modify the buffer (which it owns). But, the data from the last write() call must live a long time in the kernel. The kernel needs to keep a copy of it until the remote system ACKs all of it. The data is referenced first in the kernel transmission control system, then in the network card's ring buffers. If the remote system's feedback indicates that a packet was dropped or corrupted, the kernel may send it multiple times. The data has a different lifetime than the userspace buffer, so the kernel must copy it to a buffer it owns. On the userspace high-order memory allocations. I still don't see what the problem is. Say you're using 64kiB buffers. When you free the buffers (e.g., at the end of the http request), they go into the memory allocator's 64kiB free-list, and they're available to be allocated again (e.g., by another http request). The memory allocator won't use the 64kiB free chunks for smaller allocations, unless the free-lists for the smaller-orders are emptied out. That'd mean there was a surge in demand for smaller-size allocations, so it'd make sense to start using the higher-order free chunk instead of calling brk(). Only if there are no more high-order free chunks left will the allocator have to call brk(). When the kernel gets the brk() request, if the system is short of high-order contiguous memory, it doesn't have to give contiguous-physical pages on that brk() calls. The Page Table Entries for that request can be composed of many individual 4kiB pages scattered all over physical memory. That's hidden from userspace, userspace sees a contiguous range of virtual memory.
Re: httpd 2.4.25, mpm_event, ssl: segfaults
Why would high-order memory allocations be a problem in userspace code, which is using virtual memory? I thought high-order allocations is a big problem in kernel space, which has to deal with physical pages. But when you write to a socket, doesn't the kernel scatter the userspace buffer into multiple SKBs? SKBs on order-0 pages allocated by the kernel? On Thu, Feb 23, 2017 at 1:16 PM, Jacob Championwrote: > On 02/23/2017 08:34 AM, Yann Ylavic wrote: > > Actually I'm not very pleased with this solution (or the final one > > that would make this size open / configurable). > > The issue is potentially the huge (order big-n) allocations which > > finally may hurt the system (fragmentation, OOM...). > > Power users can break the system, and this is a power tool, right? And we > have HugeTLB kernels and filesystems to play with, with 2MB and bigger > pages... Making all these assumptions for 90% of users is perfect for the > out-of-the-box experience, but hardcoding them so that no one can fix > broken assumptions seems Bad. > > (And don't get me wrong, I think applying vectored I/O to the brigade > would be a great thing to try out and benchmark. I just think it's a > long-term and heavily architectural fix, when a short-term change to get > rid of some #defined constants could have immediate benefits.) > > I've no idea how much it costs to have 8K vs 16K records, though. >> Maybe in the mod_ssl case we'd want 16K buffers, still reasonable? >> > > We can't/shouldn't hardcode this especially. People who want maximum > throughput may want nice big records, but IIRC users who want progressive > rendering need smaller records so that they don't have to wait as long for > the first decrypted chunk. It needs to be tunable, possibly per-location. > > --Jacob >
Re: httpd 2.4.25, mpm_event, ssl: segfaults
On Wed, Feb 22, 2017 at 2:42 PM, Jacob Championwrote: > Ah, but they *do*, as Yann pointed out earlier. We can't just deliver the > disk cache to OpenSSL for encryption; it has to be copied into some > addressable buffer somewhere. That seems to be a major reason for the > mmap() advantage, compared to a naive read() solution that just reads into > a small buffer over and over again. > IOW: read():Three copies: copy from filesystem cache to httpd read() buffer to encrypted-data buffer to kernel socket buffer. mmap(): Two copies: filesystem page already mapped into httpd, so just copy from filesystem (cached) page to encrypted-data buffer to kernel socket buffer.
Re: httpd 2.4.25, mpm_event, ssl: segfaults
Is the high-level issue that: for serving static content over HTTP, you can use sendfile() from the OS filesystem cache, avoiding extra userspace copying; but if it's SSL, or any other dynamic filtering of content, you have to do extra work in userspace? On Thu, Feb 16, 2017 at 6:01 PM, Yann Ylavicwrote: > On Thu, Feb 16, 2017 at 10:51 PM, Jacob Champion > wrote: > > On 02/16/2017 02:49 AM, Yann Ylavic wrote: > >> > >> +#define FILE_BUCKET_BUFF_SIZE (64 * 1024 - 64) /* > > APR_BUCKET_BUFF_SIZE > >> */ > > > > > > So, I had already hacked my O_DIRECT bucket case to just be a copy of > APR's > > file bucket, minus the mmap() logic. I tried making this change on top of > > it... > > > > ...and holy crap, for regular HTTP it's *faster* than our current mmap() > > implementation. HTTPS is still slower than with mmap, but faster than it > was > > without the change. (And the HTTPS performance has been really variable.) > > > > Can you confirm that you see a major performance improvement with the > with > > the new 64K file buffer? > > I can't test speed for now (stick with my laptop/localhost, which > won't be relevant enough I guess). > > > I'm pretty skeptical of my own results at this > > point... but if you see it too, I think we need to make *all* these > > hard-coded numbers tunable in the config. > > We could also improve the apr_bucket_alloc()ator to recycle more > order-n allocations possibilities (saving as much > {apr_allocator_,m}alloc() calls), along with configurable/higher > orders in httpd that'd be great I think. > > I can try this patch... >
Re: httpd 2.4.25, mpm_event, ssl: segfaults
Here is how cache page replacement is done in Linux: https://linux-mm.org/PageReplacementDesign On Tue, Feb 7, 2017 at 5:32 AM, Niklas Edmundssonwrote: > On Mon, 6 Feb 2017, Jacob Champion wrote: > > > > Considering the massive amount of caching that's built into the entire >> HTTP ecosystem already, O_DIRECT *might* be an effective way to do that (in >> which we give up filesystem optimizations and caching in return for a DMA >> into userspace). I have a PoC about halfway done, but I need to split my >> time this week between this and the FCGI stuff I've been neglecting. >> > > As O_DIRECT bypasses the cache, all your IO will hit your storage device. > I think this makes it useful only in exotic usecases. > > For our workload that's using mod_cache with a local disk cache we still > want to use the remaining RAM as a disk read cache, it doesn't make sense > to force disk I/O for files that simply could have been served from RAM. We > typically see 90-95% of the requests served by mod_cache actually being > served from the disk cache in RAM rather than hitting the local disk cache > devices. I'm suspecting the picture is similar for most occurrances of > static file serving, regardless of using mod_cache etc or not. > > O_DIRECT also bypasses any read-ahead logic, so you'll have to do nice and > big IO etc to get good performance. > > We've played around with O_DIRECT to optimize the caching process in our > large-file caching module (our backing store is nfs). However, since all > our hosts are running Linux we had much better results with doing plain > reads and utilizing posix_fadvise with POSIX_FADV_WILLNEED to trigger > read-ahead and POSIX_FADV_DONTNEED to drop the original file from cache > when read (as future requests will be served from local disk cache). We're > doing 8MB fadvise chunks to get full streaming performance when caching > large files. > > But that's way out of scope for this discussion, I think ;-) > > In conclusion, I wouldn't expect any benefits of using O_DIRECT in the > common httpd usecases. That said, I would gladly be proven wrong :) > > /Nikke > -- > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > -=-=-=-=-=-=-=- > Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | ni...@acc.umu.se > > --- > "The point is I am now a perfectly safe penguin!" -- Ford Prefect > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > =-=-=-=-=-=-=-= >
Re: allow newlines in T_ESCAPE_LOGITEM?
Of course, if you use "reliable piped logging" then if you also write more than PIPE_BUF bytes (4kiB on Linux) in a log line, it's not guaranteed to be atomic. I have been thinking of how to work around that limit, by logging to an Apache Flume Thrift RPC port. But to avoid the 4kiB limit you'd have to put the Thrift client inside the httpd server. You'd really want to have a dedicated thread in each child process for the Thrift client, so that it can accumulate a batch of messages for the Flume server: otherwise, you'd be putting a lot of transaction load on Apache Flume. So the httpd threads would put their log lines on a synchronized queue consumed by the logging thread. Each child process would have it's own TCP connection to Apache Flume, so there's none of the atomic message issues you have with > PIPE_BUF messages using a Pipe. It'd also be nice if Apache Thrift had a c_apr target in addition to the c_glib target, so that you don't have to bring in glib to use it. Anyway, it seemed to be some work to implement, so I think I will just live with the limit of 4kiB atomic log entries. Here is the Flume Thrift protocol: namespace java org.apache.flume.thrift struct ThriftFlumeEvent { 1: required mapheaders, 2: required binary body, } enum Status { OK, FAILED, ERROR, UNKNOWN } service ThriftSourceProtocol { Status append(1: ThriftFlumeEvent event), Status appendBatch(1: list events), } On Mon, Apr 18, 2016 at 4:37 PM, William A Rowe Jr wrote: > That doesn't work because it's no longer atomic, those log lines end up > fanned out among all the other worker threads doing the same thing. Very > hard to parse. > > On Mon, Apr 18, 2016 at 3:08 PM, Stefan Fritsch wrote: > >> On Monday 18 April 2016 10:42:23, Eric Covener wrote: >> > I'm won over and not tinkering with this FWIW. >> >> Maybe add a utility function that splits a message on \n and calls >> ap_log_rerror for each line? >> > >
Re: allow newlines in T_ESCAPE_LOGITEM?
Doing this in the httpd server will prevent use-cases of programmatically parsing the log files. I think the data in the file should support both use-cases - Parsing the log file data for data analysis. - Humans reading the logs. So, the raw data should support both use-cases. For a human reading the logs, that should be handled by an application that parses the raw data and presents it in human readable format. Here is a simple application: #!/usr/bin/python import sys, fileinput write = sys.stdout.write separator = '-'*78 + '\n' for line in fileinput.input(): lines = line.decode("string_escape") write(lines) if lines.count('\n') > 1: write(separator) You could also have an application with much more functionality, color-coding fields, using underline to separate records instead of dashes, whatever you want. The above application may not meet exactly what the OP wants: but that's my point: each user's use-case is different, and the raw data should support all the use-cases. On Mon, Apr 18, 2016 at 8:28 AM, Stefan Eissing < stefan.eiss...@greenbytes.de> wrote: > Would it make sense to replace \n with \n\t? > > > Am 18.04.2016 um 14:23 schrieb Jim Jagielski: > > > > No opinion :) > > > >> On Apr 13, 2016, at 3:43 PM, Eric Covener wrote: > >> > >> Currently newlines get backslash-escaped if written to the errorlog. > >> This is via server/gen_test_char.c and stems from an ancient vuln > >> about escape sequences in log files potentially affecting peoples > >> terminals when cat'ed. > >> > >> On a few occasions I have worked with some libraries that return a > >> newline-formatted buffer that I'd love to dump to the error log > >> without scanning and splitting up into multiple log entries. > >> > >> I also occasionally see \n's from third-party mods once in a while. > >> > >> Any opinions on punching a hole for \n in T_ESCAPE_LOGITEM? > > > >
Re: allow newlines in T_ESCAPE_LOGITEM?
It's especially important when doing log processing on Apache Hadoop, if you give uncompressed text files as input files to a Hadoop job, it'd split large log files on newlines to be processed on multiple nodes. That split should be done on a record boundary. On Wed, Apr 13, 2016 at 5:16 PM, Yann Ylavic <ylavic@gmail.com> wrote: > On Wed, Apr 13, 2016 at 11:08 PM, Eric Covener <cove...@gmail.com> wrote: > > On Wed, Apr 13, 2016 at 5:05 PM, Daniel Lescohier > > <daniel.lescoh...@cbsi.com> wrote: > >> Isn't T_ESCAPE_LOGITEM also used by mod_log_config's use of > >> ap_escape_logitem? We rely on the API that data from HTTP requests > that are > >> logged in our mod_log_config logfiles are newline-escaped, so that one > line > >> in the logfile is parsed as one log entry. Our parsers first split on > >> newline to get records, then splits the fields of the record on the > field > >> delimiter to get fields, then it unescapes the backslash-escapes to get > the > >> original data for that field. > > > > You make a good point, it couldn't change and affect current callers > > of ap_escape_logitem(). > > IMHO, even ErrorLog shouldn't contain splitted lines (w/o "[date] > [level] [pid]" prefix). >
Re: allow newlines in T_ESCAPE_LOGITEM?
Isn't T_ESCAPE_LOGITEM also used by mod_log_config's use of ap_escape_logitem? We rely on the API that data from HTTP requests that are logged in our mod_log_config logfiles are newline-escaped, so that one line in the logfile is parsed as one log entry. Our parsers first split on newline to get records, then splits the fields of the record on the field delimiter to get fields, then it unescapes the backslash-escapes to get the original data for that field. On Wed, Apr 13, 2016 at 3:43 PM, Eric Covenerwrote: > Currently newlines get backslash-escaped if written to the errorlog. > This is via server/gen_test_char.c and stems from an ancient vuln > about escape sequences in log files potentially affecting peoples > terminals when cat'ed. > > On a few occasions I have worked with some libraries that return a > newline-formatted buffer that I'd love to dump to the error log > without scanning and splitting up into multiple log entries. > > I also occasionally see \n's from third-party mods once in a while. > > Any opinions on punching a hole for \n in T_ESCAPE_LOGITEM? >
Re: Do pools lead to bad programming?
The format string is a multiplier of length. 4k repeated %Y elements in the 'a' variable's format string is 8kbytes in the format string, but the result would take 16kbytes. Then you have things like %B: the full month name according to the current locale. You cannot really predict the length without actually doing the strftime. I figured 256 bytes is more than reasonable. Normal time strings are 32 bytes; double that, and quadruple that again (e.g. if there are utf-8 chars in the string), and you get 256 bytes. If people need more, one can use in LogFormat %{formatstring1}t%{formatstring2}t. On Fri, Dec 13, 2013 at 10:14 AM, Yann Ylavic ylavic@gmail.com wrote: On Fri, Dec 13, 2013 at 5:06 AM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: Here is my draft replacement: static const char *log_request_time_custom(request_rec *r, char *a, apr_time_exp_t *xt) { static const apr_size_t buf_len = 256; apr_size_t tstr_len; char *tstr = apr_palloc(r-pool, buf_len); apr_strftime(tstr, tstr_len, buf_len, a, xt); return tstr_len ? tstr : -; } Shouldn't [MAX_STRING_LEN buf_len strlen(a)] for apr_strftime() not to truncate the result (for any resonnable 'a')?
Re: Do pools lead to bad programming?
We can also save stack space by changing: char server_portstr[32]; to: char server_portstr[6]; Or if we want to future-proof against the small possibility of a new TCP standard that has larger port numbers and negative port numbers: char server_portstr[sizeof(apr_port_t)*241/100+3]; /* log10(256) is 2.408 */ As part of my time caching work, I'm fixing a memory buffer issue in mod_log_config.c. Here is the current code: static const char *log_request_time_custom(request_rec *r, char *a, apr_time_exp_t *xt) { apr_size_t retcode; char tstr[MAX_STRING_LEN]; apr_strftime(tstr, retcode, sizeof(tstr), a, xt); return apr_pstrdup(r-pool, tstr); } This function needs the string on the heap, because the string is returned to the caller. By first using stack memory, one has to add a memory copy before returning. Also, this expands the stack by 8KiB, and stack might be a limited resource on some OSes. The final problem is that the retcode isn't checked: if the retcode is 0, the content of the buffer is undefined, and shouldn't be used; in theory, even 8KiB might not be a big enough buffer. Here is my draft replacement: static const char *log_request_time_custom(request_rec *r, char *a, apr_time_exp_t *xt) { static const apr_size_t buf_len = 256; apr_size_t tstr_len; char *tstr = apr_palloc(r-pool, buf_len); apr_strftime(tstr, tstr_len, buf_len, a, xt); return tstr_len ? tstr : -; } On Thu, Dec 12, 2013 at 7:37 AM, Yann Ylavic ylavic@gmail.com wrote: On Thu, Dec 12, 2013 at 1:54 AM, Kean Johnston kean.johns...@gmail.comwrote: I'd love to see these things fixed, because they add up. If you post them here they are likely to be reviewed very quickly, as they'll no doubt be simple to review. Cool. Here's a patch for the case I just mentioned. It also eliminates an un-needed automatic (yes, I obsess about stack frames too). diff against trunk. Note that in this particular cases (ie. uri = apr_palloc(p, sizeof(*uri)) used in proxy_(ajp|fcgi|http|scgi|wstunnel)_handler), the local uri would better be declared on the stack (ie. apr_uri_t uri; and uri used accordingly), so to avoid the allocation completely. Regards, Yann.
Re: time caching in util_time.c and mod_log_config.c
On Wed, Dec 4, 2013 at 7:47 AM, Jim Jagielski j...@jagunet.com wrote: Also, IMO, the default should be non-portable atomics. Yes: that is the purpose of APR: having a portable API on top of non-portable implementations for each platform.
Re: time caching in util_time.c and mod_log_config.c
So it sounds like I should go ahead and work on an implementation of the time caching using apr_atomic_cas32 and apr_atomic_dec32. This won't be an issue for RHEL/CentOS/etc., because they're using old versions of httpd. We can put something in the release notes saying that for 32-bit i486, i586, i686 builds, you should build APR with --enable-nonportable-atomics, for distro maintainers info when they package the new version. On Wed, Dec 4, 2013 at 7:47 AM, Jim Jagielski j...@jagunet.com wrote: Adding APR dev list: IMO, httpd should expect APR to do the right thing. If APR isn't doing that, then it's an APR bug and needs to be fixed/ addressed within APR. All this implies that the atomics code in APR needs a serious review and update. We should also look into leveraging what we can from stdcxx (https://svn.apache.org/repos/asf/stdcxx/trunk/src) as well as OPA (https://trac.mcs.anl.gov/projects/openpa/wiki/FAQ). Also, IMO, the default should be non-portable atomics. On Dec 3, 2013, at 7:41 PM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: So I think we should reach a consensus on what approach to take. My goal was to implement an algorithm that is correct, with code that is easy to maintain. I think using the apr_atomics functions meets those goals the best. The downside are for those systems that are running 32-bit i486, i586, i686 systems where the default APR configure setting was not overridden for atomics. There may be i686 servers still out there using 32-bit web server, probably memory-constrained systems like VPS hosts; the question is have they overridden the APR default configuration or not. Should we hold back on fixing this because of these systems? If we go forward, should there be something in the release notes warning of this APR configuration issue? On Tue, Dec 3, 2013 at 7:15 PM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: (continued, hit send too early) %ix86 i386 i486 i586 i686 pentium3 pentium4 athlon geode However, I looked at the CentOS 6 apr.spec, and it's not overriding the default.
Re: time caching in util_time.c and mod_log_config.c
I infer that you think that in this particular case, the function macro makes the code more readable and maintainable. I don't think one can define an absolute rule, it's a judgment call whether a macro increases or decreases clarity and maintainability. It reminds me of 'goto': most of the time, one shouldn't use it, but there are cases where it's better than the alternative. On Tue, Dec 3, 2013 at 10:01 AM, Jim Jagielski j...@jagunet.com wrote: At first blush, the below looks very workable! On a slightly off-topic topic, however, I wonder if macros are the way to go. Long long ago function calls were really really expensive. Does the decreased clarity (and debugging capability) really offset the saved cycles? Agreed that it's not fully applicable below. On Dec 2, 2013, at 6:13 PM, Daniel Lescohier daniel.lescoh...@cbsinteractive.com wrote: The current time caching implementation in util_time.c and mod_log_config.c is not guaranteed to work with all compilers and cpu architectures. I have some proposed code I've written, that I want to get input from the mailing list on, before continuing on to compile and test it. The old implementation tries to do a wait-free algorithm, but incorrectly, because the algorithm relies on total memory ordering, and the implementation does not guarantee total memory ordering from the compiler or CPU. My proposal is to use the standard and portable apr_thread_mutex_trylock for enforcing the memory barriers. For APR's Linux x86_64 implementation, this basically turns into a lock: cmpxchgl instruction, and a lock: decl instruction for the unlock. Because only trylock is used, not lock, the futex is never spun and arbitrated on by the kernel, it's all done in userspace (if there's contention, each thread just calculates the value itself instead of reading from the cache). So the replacement is also a wait-free algorithm, using the standard and portable apr_thread_mutex calls. Also, the old algorithm does two memcpy operations when reading from the cache, while the new algorithm just does one. For APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done. The first part of the code is generic time caching code I've written in util_time_caching.h. I use these macros to create five different time caches: excerpt w/o boilerplate: --- #define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\ static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER]; #define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC)\ static apr_status_t (*FUNC_PTR)(VALUE_T *, apr_time_t) = CALC_FUNC; /* TIME_CACHE_FUNCTION macro: * The cache is implemented as a ring buffer. The key in the * cache element indicates which second the cache value is for. * We use a wait-free algorithm: if we cannot access the cache, * we just calculate the value, doing useful work, instead of * spinning trying to access the cache. * We only update the cache with newer times, because older times * should have a lower cache hit ratio, and we want to keep the * caches small to fit in the CPU L1/L2 caches. */ #define TIME_CACHE_FUNCTION(FUNC_NAME, VALUE_T, VALUE_SIZE, \ CACHE_T, CACHE_PTR, CACHE_SIZE_POWER, \ CALC_FUNC, TRY_LOCK, RELEASE_LOCK, AFTER_READ_WORK)\ static apr_status_t FUNC_NAME(VALUE_T *value, apr_time_t t)\ {\ const apr_int64_t seconds = apr_time_sec(t);\ apr_status_t status;\ CACHE_T * const cache_element = \ (CACHE_PTR[seconds ((1CACHE_SIZE_POWER)-1)]);\ /* seconds==0 can be confused with unitialized cache; don't use cache */\ if (seconds==0) return CALC_FUNC(value, t);\ if (TRY_LOCK) {\ if (seconds == cache_element-key) {\ memcpy(value, cache_element-value, VALUE_SIZE);\ RELEASE_LOCK;\ AFTER_READ_WORK;\ return APR_SUCCESS;\ }\ if (seconds cache_element-key) {\ RELEASE_LOCK;\ return CALC_FUNC(value, t);\ }\ RELEASE_LOCK;\ }\ status = CALC_FUNC(value, t);\ if (status == APR_SUCCESS) {\ if (TRY_LOCK) {\ if (seconds cache_element-key) {\ cache_element-key = seconds;\ memcpy(cache_element-value, value, VALUE_SIZE);\ }\ RELEASE_LOCK;\ }\ }\ return status;\ } -- Here is an example of implementing one of the caches in util_time.c: -- /* Have small sized caches, to stay in CPU's L1/L2 caches. * There should be few requests older than 3 seconds, so set * cache size power to 2. */ #define TIME_CACHE_SIZE_POWER 2 typedef struct { apr_int64_t key; apr_time_exp_t value; } explode_time_cache_t; TIME_CACHE(explode_time_cache_t
Re: time caching in util_time.c and mod_log_config.c
If the developers list is OK using apr_atomic in the server core, there would be lots of advantages over trylock: 1. No need for child init. 2. No need for function pointers. 3. Could have a lock per cache element (I deemed it too expensive memory-wise to have a large mutex structure per cache element). 4. It would avoid the problem of trylock not being implemented on all platforms. 5. Fewer parameters to the function macro. The code would be like this: #define TIME_CACHE_FUNCTION(VALUE_SIZE, CACHE_T, CACHE_PTR, CACHE_SIZE_POWER,\ CALC_FUNC, AFTER_READ_WORK\ )\ const apr_int64_t seconds = apr_time_sec(t);\ apr_status_t status;\ CACHE_T * const cache_element = \ (CACHE_PTR[seconds ((1CACHE_SIZE_POWER)-1)]);\ /* seconds==0 can be confused with unitialized cache; don't use cache */\ if (seconds==0) return CALC_FUNC(value, t);\ if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\ if (seconds == cache_element-key) {\ memcpy(value, cache_element-value, VALUE_SIZE);\ apr_atomic_dec32(cache_element-lock);\ AFTER_READ_WORK;\ return APR_SUCCESS;\ }\ if (seconds cache_element-key) {\ apr_atomic_dec32(cache_element-lock);\ return CALC_FUNC(value, t);\ }\ apr_atomic_dec32(cache_element-lock);\ }\ status = CALC_FUNC(value, t);\ if (status == APR_SUCCESS) {\ if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\ if (seconds cache_element-key) {\ cache_element-key = seconds;\ memcpy(cache_element-value, value, VALUE_SIZE);\ }\ apr_atomic_dec32(cache_element-lock);\ }\ }\ return status; -- typedef struct { apr_int64_t key; apr_uint32_t lock; apr_time_exp_t value; } explode_time_cache_t; TIME_CACHE(explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER) AP_DECLARE(apr_status_t) ap_explode_recent_localtime( apr_time_exp_t * value, apr_time_t t) { TIME_CACHE_FUNCTION( sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER, apr_time_exp_lt, value-tm_usec = (apr_int32_t) apr_time_usec(t)) } On Tue, Dec 3, 2013 at 10:53 AM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: I infer that you think that in this particular case, the function macro makes the code more readable and maintainable. I don't think one can define an absolute rule, it's a judgment call whether a macro increases or decreases clarity and maintainability. It reminds me of 'goto': most of the time, one shouldn't use it, but there are cases where it's better than the alternative. On Tue, Dec 3, 2013 at 10:01 AM, Jim Jagielski j...@jagunet.com wrote: At first blush, the below looks very workable! On a slightly off-topic topic, however, I wonder if macros are the way to go. Long long ago function calls were really really expensive. Does the decreased clarity (and debugging capability) really offset the saved cycles? Agreed that it's not fully applicable below. On Dec 2, 2013, at 6:13 PM, Daniel Lescohier daniel.lescoh...@cbsinteractive.com wrote: The current time caching implementation in util_time.c and mod_log_config.c is not guaranteed to work with all compilers and cpu architectures. I have some proposed code I've written, that I want to get input from the mailing list on, before continuing on to compile and test it. The old implementation tries to do a wait-free algorithm, but incorrectly, because the algorithm relies on total memory ordering, and the implementation does not guarantee total memory ordering from the compiler or CPU. My proposal is to use the standard and portable apr_thread_mutex_trylock for enforcing the memory barriers. For APR's Linux x86_64 implementation, this basically turns into a lock: cmpxchgl instruction, and a lock: decl instruction for the unlock. Because only trylock is used, not lock, the futex is never spun and arbitrated on by the kernel, it's all done in userspace (if there's contention, each thread just calculates the value itself instead of reading from the cache). So the replacement is also a wait-free algorithm, using the standard and portable apr_thread_mutex calls. Also, the old algorithm does two memcpy operations when reading from the cache, while the new algorithm just does one. For APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done. The first part of the code is generic time caching code I've written in util_time_caching.h. I use these macros to create five different time caches: excerpt w/o boilerplate: --- #define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\ static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER]; #define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC
Re: time caching in util_time.c and mod_log_config.c
I took a look at apr's configure.in, and it's default for all architectures except for i486, i586, and i686 is to use the real atomic ops, but for those three architectures the default is to use the generic atomic ops. Any idea why there is a special rule for those three architectures? There's nothing wrong with the atomic operations on those three architectures: otherwise, how have we had semaphores and mutexes for all these years on those CPUs? I guess that is a question for the APR dev mailing list. I see that some distros override that default. E.g., the libapr1.spec for openSUSE has: %ifarch %ix86 --enable-nonportable-atomics=yes \ %endif and in /usr/lib/rpm/macros: On Tue, Dec 3, 2013 at 12:54 PM, Yann Ylavic ylavic@gmail.com wrote: I personnally like this solution better (IMHO) since it does not rely on apr_thread_mutex_trylock() to be wait-free/userspace (eg. natively implements the compare and swap). On the other hand, apr_atomic_cas32() may itself be implemented using apr_thread_mutex_lock() when USE_ATOMICS_GENERIC is defined (explicitly, or with --enable-nonportable-atomics=no, or else forcibly with gcc -stdc=c89 or intel cpus = i686). Hence with USE_ATOMICS_GENERIC, apr_thread_mutex_trylock() may be a better solution than the apr_thread_mutex_lock()... On Tue, Dec 3, 2013 at 6:01 PM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: If the developers list is OK using apr_atomic in the server core, there would be lots of advantages over trylock: 1. No need for child init. 2. No need for function pointers. 3. Could have a lock per cache element (I deemed it too expensive memory-wise to have a large mutex structure per cache element). 4. It would avoid the problem of trylock not being implemented on all platforms. 5. Fewer parameters to the function macro. The code would be like this: #define TIME_CACHE_FUNCTION(VALUE_SIZE, CACHE_T, CACHE_PTR, CACHE_SIZE_POWER,\ CALC_FUNC, AFTER_READ_WORK\ )\ const apr_int64_t seconds = apr_time_sec(t);\ apr_status_t status;\ CACHE_T * const cache_element = \ (CACHE_PTR[seconds ((1CACHE_SIZE_POWER)-1)]);\ /* seconds==0 can be confused with unitialized cache; don't use cache */\ if (seconds==0) return CALC_FUNC(value, t);\ if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\ if (seconds == cache_element-key) {\ memcpy(value, cache_element-value, VALUE_SIZE);\ apr_atomic_dec32(cache_element-lock);\ AFTER_READ_WORK;\ return APR_SUCCESS;\ }\ if (seconds cache_element-key) {\ apr_atomic_dec32(cache_element-lock);\ return CALC_FUNC(value, t);\ }\ apr_atomic_dec32(cache_element-lock);\ }\ status = CALC_FUNC(value, t);\ if (status == APR_SUCCESS) {\ if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\ if (seconds cache_element-key) {\ cache_element-key = seconds;\ memcpy(cache_element-value, value, VALUE_SIZE);\ }\ apr_atomic_dec32(cache_element-lock);\ }\ }\ return status; -- typedef struct { apr_int64_t key; apr_uint32_t lock; apr_time_exp_t value; } explode_time_cache_t; TIME_CACHE(explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER) AP_DECLARE(apr_status_t) ap_explode_recent_localtime( apr_time_exp_t * value, apr_time_t t) { TIME_CACHE_FUNCTION( sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER, apr_time_exp_lt, value-tm_usec = (apr_int32_t) apr_time_usec(t)) }
Re: time caching in util_time.c and mod_log_config.c
(continued, hit send too early) %ix86 i386 i486 i586 i686 pentium3 pentium4 athlon geode However, I looked at the CentOS 6 apr.spec, and it's not overriding the default.
Re: time caching in util_time.c and mod_log_config.c
So I think we should reach a consensus on what approach to take. My goal was to implement an algorithm that is correct, with code that is easy to maintain. I think using the apr_atomics functions meets those goals the best. The downside are for those systems that are running 32-bit i486, i586, i686 systems where the default APR configure setting was not overridden for atomics. There may be i686 servers still out there using 32-bit web server, probably memory-constrained systems like VPS hosts; the question is have they overridden the APR default configuration or not. Should we hold back on fixing this because of these systems? If we go forward, should there be something in the release notes warning of this APR configuration issue? On Tue, Dec 3, 2013 at 7:15 PM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: (continued, hit send too early) %ix86 i386 i486 i586 i686 pentium3 pentium4 athlon geode However, I looked at the CentOS 6 apr.spec, and it's not overriding the default.
time caching in util_time.c and mod_log_config.c
The current time caching implementation in util_time.c and mod_log_config.c is not guaranteed to work with all compilers and cpu architectures. I have some proposed code I've written, that I want to get input from the mailing list on, before continuing on to compile and test it. The old implementation tries to do a wait-free algorithm, but incorrectly, because the algorithm relies on total memory ordering, and the implementation does not guarantee total memory ordering from the compiler or CPU. My proposal is to use the standard and portable apr_thread_mutex_trylock for enforcing the memory barriers. For APR's Linux x86_64 implementation, this basically turns into a lock: cmpxchgl instruction, and a lock: decl instruction for the unlock. Because only trylock is used, not lock, the futex is never spun and arbitrated on by the kernel, it's all done in userspace (if there's contention, each thread just calculates the value itself instead of reading from the cache). So the replacement is also a wait-free algorithm, using the standard and portable apr_thread_mutex calls. Also, the old algorithm does two memcpy operations when reading from the cache, while the new algorithm just does one. For APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done. The first part of the code is generic time caching code I've written in util_time_caching.h. I use these macros to create five different time caches: excerpt w/o boilerplate: --- #define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\ static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER]; #define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC)\ static apr_status_t (*FUNC_PTR)(VALUE_T *, apr_time_t) = CALC_FUNC; /* TIME_CACHE_FUNCTION macro: * The cache is implemented as a ring buffer. The key in the * cache element indicates which second the cache value is for. * We use a wait-free algorithm: if we cannot access the cache, * we just calculate the value, doing useful work, instead of * spinning trying to access the cache. * We only update the cache with newer times, because older times * should have a lower cache hit ratio, and we want to keep the * caches small to fit in the CPU L1/L2 caches. */ #define TIME_CACHE_FUNCTION(FUNC_NAME, VALUE_T, VALUE_SIZE, \ CACHE_T, CACHE_PTR, CACHE_SIZE_POWER, \ CALC_FUNC, TRY_LOCK, RELEASE_LOCK, AFTER_READ_WORK)\ static apr_status_t FUNC_NAME(VALUE_T *value, apr_time_t t)\ {\ const apr_int64_t seconds = apr_time_sec(t);\ apr_status_t status;\ CACHE_T * const cache_element = \ (CACHE_PTR[seconds ((1CACHE_SIZE_POWER)-1)]);\ /* seconds==0 can be confused with unitialized cache; don't use cache */\ if (seconds==0) return CALC_FUNC(value, t);\ if (TRY_LOCK) {\ if (seconds == cache_element-key) {\ memcpy(value, cache_element-value, VALUE_SIZE);\ RELEASE_LOCK;\ AFTER_READ_WORK;\ return APR_SUCCESS;\ }\ if (seconds cache_element-key) {\ RELEASE_LOCK;\ return CALC_FUNC(value, t);\ }\ RELEASE_LOCK;\ }\ status = CALC_FUNC(value, t);\ if (status == APR_SUCCESS) {\ if (TRY_LOCK) {\ if (seconds cache_element-key) {\ cache_element-key = seconds;\ memcpy(cache_element-value, value, VALUE_SIZE);\ }\ RELEASE_LOCK;\ }\ }\ return status;\ } -- Here is an example of implementing one of the caches in util_time.c: -- /* Have small sized caches, to stay in CPU's L1/L2 caches. * There should be few requests older than 3 seconds, so set * cache size power to 2. */ #define TIME_CACHE_SIZE_POWER 2 typedef struct { apr_int64_t key; apr_time_exp_t value; } explode_time_cache_t; TIME_CACHE(explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER) TIME_CACHE_FUNC_PTR(explode_time_lt_ptr, apr_time_exp_t, apr_time_exp_lt) TIME_CACHE_FUNCTION( explode_time_lt_nolocking, apr_time_exp_t, sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER, apr_time_exp_lt, 1,, value-tm_usec = (apr_int32_t) apr_time_usec(t)) #if APR_HAS_THREADS static apr_thread_mutex_t *explode_time_lt_lock; TIME_CACHE_FUNCTION( explode_time_lt_locking, apr_time_exp_t, sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER, apr_time_exp_lt, apr_thread_mutex_trylock(explode_time_lt_lock)==APR_SUCCESS, apr_thread_mutex_unlock(explode_time_lt_lock), value-tm_usec = (apr_int32_t) apr_time_usec(t) #endif AP_DECLARE(apr_status_t) ap_explode_recent_localtime(apr_time_exp_t * tm, apr_time_t t) { return explode_recent_lt_ptr(tm, t); } -
Re: mod_autoindex string pluggability
output_directories seems html-specific code. Why not implement a brand-new output_directories_json, and have index_directories() function choose which function to call based on query args? Anyway, index_directories() has to change to send the correct content-type header, and to skip emit_head and emit_tail function calls. On Mon, Aug 5, 2013 at 2:26 AM, Sven Dowideit svendowid...@home.org.auwrote: Hello Everyone, I'm scratching an itch to make mod_autoindex output what I want, and would love to know what, if anything would make the changes merge-able. In its simplest form, I'd like apache to be able to give me an index in JSON format - previously, I've parsed the html in javascript, but somehow I think I can do better. While I was reading the code (today) it occurred to me that there are a lot of if statements throughout, which could be eliminated by moving (obscuring) the output strings to a switchable container (right now I'm using arrays of strings for my simplicity - I don't know this codebase any better than you know me :) so here is the kind of thing I started to do (partial diff, its all really kind of dull - I've extracted the HTML/XHTML strings into another similarly replaceable array): #define STRING_INDEX_START 0 #define STRING_INDEX_END 1 const char *table_index_string[] = { table\n tr, /table\n }; const char *table_index_string_stylesheet[] = { table id=\indexlist\\n tr class=\indexhead\, /table\n }; const char *fancy_index_string[] = { pre, /pre\n }; const char *default_index_string[] = { ul, /ul\n }; /* set the default string set (choose alternatives based on user conf options) */ const char **index_string = default_index_string; @@ -1873,23 +1872,14 @@ static void output_directories(struct ent **ar, int n, } } if (autoindex_opts TABLE_INDEXING) { -ap_rvputs(r, breakrow, /table\n, NULL); +ap_rputs(breakrow, r); } else if (autoindex_opts FANCY_INDEXING) { if (!(autoindex_opts SUPPRESS_RULES)) { -ap_rputs(hr, r); -if (autoindex_opts EMIT_XHTML) { -ap_rputs( /, r); -} -ap_rputs(/pre\n, r); -} -else { -ap_rputs(/pre\n, r); +ap_rputs(output_string[STRING_HR], r); } } -else { -ap_rputs(/ul\n, r); -} +ap_rputs(index_string[STRING_INDEX_END], r); } Cheers Sven
Re: Decrypting mod_session-created cookie
Looking at the code, the encoded data block does not have a version number field in it, so the data format and general algorithm cannot be easily changed in the future. The encoded data seems to be comprised of: base64 encode of: - salt (fixed length of sizeof(apr_uuid_t)) - the Initialization Vector - the encrypted data But to understand the algorithm, one would have to document what all the apr_crypto_* functions do in the module's encrypt/decrypt functions. For example, the apr_crypto_passphrase takes the passphrase and the salt as parameters (along with other parameters such as block chaining mode [which mod_session_crypto has set fixed to CBC], number of iterations [which mod_session_crypto has set fixed to 4096 iterations]). But what is done with it when it's passed to APR library? One would have to look at the implementation of that function in order to find out. Taking a quick look at the apr source, for the openssl implementation, apr_crypto_passphrase is using PKCS5_PBKDF2_HMAC_SHA1 to generate the key from the passphrase and salt. So, it doesn't look like it's an quick and easy thing to document or re-implement. On Tue, Jul 9, 2013 at 9:01 AM, André Malo n...@perlig.de wrote: On Tuesday 09 July 2013 14:51:49 Mikhail T. wrote: 09.07.2013 08:31, Eric Covener написав(ла): I am not surprised that separately documenting is not a priority for anyone, given it's implemented in httpd for anyone to see. Source code is not documentation, is it? In matters of encryption especially the methods chosen should be documented better -- not only for interoperability, but also to allow to better judge the strengths of the method (not that I personally am capable of the latter). I agree. And everybody is welcome to help fixing that issue. nd
Re: Decrypting mod_session-created cookie
https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy The session will be automatically decrypted on load, and encrypted on save by Apache, the underlying application using the session need have no knowledge that encryption is taking place. On Mon, Jul 8, 2013 at 6:58 PM, Mikhail T. mi+t...@aldan.algebra.comwrote: From PHP I need to be able to set and read the session cookies created by mod_session_crypto. Suppose, I know the SessionCryptoCipher (aes256, the default) and the SessionCryptoPassphrase, how can I decrypt the session-cookie? Its value is available to PHP as _REQUEST['session']... I have both openssl and mcrypt PHP-extensions available on the server, but the straightforward approach of $decrypted = openssl_decrypt($_REQUEST['session'], 'aes256', $password); is failing... Thank you! Yours, -mi
Re: Decrypting mod_session-created cookie
The mod_session_crypto.c adds a salt (from calling apr_uuid_get) to the data when encrypting it. Without a salt, the encryption wouldn't be that strong. Perhaps your decryption code isn't handling the salt? On Mon, Jul 8, 2013 at 7:29 PM, Graham Leggett minf...@sharp.fm wrote: On 9 Jul 2013, at 00:11, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy The session will be automatically decrypted on load, and encrypted on save by Apache, the underlying application using the session need have no knowledge that encryption is taking place. See also the section on integrating with external applications. https://httpd.apache.org/docs/2.4/mod/mod_session.html#integration Regards, Graham --
Re: Decrypting mod_session-created cookie
You could perhaps also setup Apache as a reverse-proxy to the other application, so Apache will decrypt it before proxying it to the other application. On Mon, Jul 8, 2013 at 7:33 PM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: The mod_session_crypto.c adds a salt (from calling apr_uuid_get) to the data when encrypting it. Without a salt, the encryption wouldn't be that strong. Perhaps your decryption code isn't handling the salt? On Mon, Jul 8, 2013 at 7:29 PM, Graham Leggett minf...@sharp.fm wrote: On 9 Jul 2013, at 00:11, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy The session will be automatically decrypted on load, and encrypted on save by Apache, the underlying application using the session need have no knowledge that encryption is taking place. See also the section on integrating with external applications. https://httpd.apache.org/docs/2.4/mod/mod_session.html#integration Regards, Graham --
Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID
Another option: typedef struct { apr_uint32_t stamp; apr_uint32_t counter; apr_uint16_t stamp_fraction; char root[ROOT_SIZE]; } unique_id_rec; where ROOT_SIZE=8, and stamp_fraction is set on every request to htons(r-request_time 0x). The child will be initialized with 12 bytes of random data (root + counter). If one is not as confident of having a good random source, then it's better to also include the fractional second: there will be few requests happening in the same 1/65,536th of a second of the wall clock; if there are multiple requests in that fraction of a second, if the multiple requests are in the same child process, then the atomic increment (apr_atomic_inc32) of the counter alone will guarantee different ids; otherwise, for those few requests in that fraction of a second in multiple processes (on the same or different servers), the randomness initialized by the child in root+counter will be used to make different ids. On Sat, Jul 6, 2013 at 9:38 AM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: Ah, I missed that. I see it's in the doxygen docs for the random module. However, the sources aren't under random/, they're under misc/. I was switching between the doxygen docs and looking at the sources; maybe when I was looking for it, I missed it because the sources were under misc/. For the standard module: you can keep the unique id size the same, but still have a root of 10 bytes, by getting rid of the thread_index: thread_index is very wasteful, using 32 bits, when only a few bits are used. Change: typedef struct { unsigned int stamp; unsigned int in_addr; unsigned int pid; unsigned short counter; unsigned int thread_index; } unique_id_rec; to: typedef struct { apr_uint32_t stamp; apr_uint32_t counter; char root[ROOT_SIZE]; } unique_id_rec; Have the two ints first for alignment purposes, so there is no padding in the struct. With the counter field using apr_uint32_t, you can use apr_atomic_inc32() to do the increments. There is less need of thread_index if you increment the counter atomically. The initialization of the counter with random data also gives you more process randomness than the 10 bytes of root. On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote: On Wednesday 26 June 2013, Daniel Lescohier wrote: When I looked into the ap random functions, I didn't like the implementation, because I didn't see anywhere in the httpd codebase that entropy is periodically added to the entropy pool. After reading the details of how the Linux entropy pool works (https://lwn .net/Articles/525204/), I decided to use /dev/urandom instead, since Linux is periodically adding entropy to it. This code is not portable, but this was for a private Apache module that is only used on Linux. To preserve entropy on the web server machine, I also only generate a random number once per apache child, then increment an uint32 portion of it for each unique id call. I also have seconds and microseconds, so that's why I think it's OK to do increments from the random base, instead of generating a new random id on each request. The insecure in ap_random_insecure_bytes is there for a reason. But if you only use it once per process, anyway, it should be sufficient. The fact that several consumers (especially with multi-threaded mpms) pull from the same pool in undefined order adds some entropy, too. FWIW, there is apr_generate_random_bytes() which can do the reading of /dev/urandom for you.
Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID
Actually, for stamp_fraction, one should choose either: r-request_time 0x /* microseconds modulus 65536 */ apr_uint16_t ui16; htons(ui16 = apr_time_usec(r-request_time) 4) /* fraction of a second with denominator 62500 */ On Sun, Jul 7, 2013 at 7:32 AM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: Another option: typedef struct { apr_uint32_t stamp; apr_uint32_t counter; apr_uint16_t stamp_fraction; char root[ROOT_SIZE]; } unique_id_rec; where ROOT_SIZE=8, and stamp_fraction is set on every request to htons(r-request_time 0x). The child will be initialized with 12 bytes of random data (root + counter). If one is not as confident of having a good random source, then it's better to also include the fractional second: there will be few requests happening in the same 1/65,536th of a second of the wall clock; if there are multiple requests in that fraction of a second, if the multiple requests are in the same child process, then the atomic increment (apr_atomic_inc32) of the counter alone will guarantee different ids; otherwise, for those few requests in that fraction of a second in multiple processes (on the same or different servers), the randomness initialized by the child in root+counter will be used to make different ids. On Sat, Jul 6, 2013 at 9:38 AM, Daniel Lescohier daniel.lescoh...@cbsi.com wrote: Ah, I missed that. I see it's in the doxygen docs for the random module. However, the sources aren't under random/, they're under misc/. I was switching between the doxygen docs and looking at the sources; maybe when I was looking for it, I missed it because the sources were under misc/. For the standard module: you can keep the unique id size the same, but still have a root of 10 bytes, by getting rid of the thread_index: thread_index is very wasteful, using 32 bits, when only a few bits are used. Change: typedef struct { unsigned int stamp; unsigned int in_addr; unsigned int pid; unsigned short counter; unsigned int thread_index; } unique_id_rec; to: typedef struct { apr_uint32_t stamp; apr_uint32_t counter; char root[ROOT_SIZE]; } unique_id_rec; Have the two ints first for alignment purposes, so there is no padding in the struct. With the counter field using apr_uint32_t, you can use apr_atomic_inc32() to do the increments. There is less need of thread_index if you increment the counter atomically. The initialization of the counter with random data also gives you more process randomness than the 10 bytes of root. On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote: On Wednesday 26 June 2013, Daniel Lescohier wrote: When I looked into the ap random functions, I didn't like the implementation, because I didn't see anywhere in the httpd codebase that entropy is periodically added to the entropy pool. After reading the details of how the Linux entropy pool works (https://lwn .net/Articles/525204/), I decided to use /dev/urandom instead, since Linux is periodically adding entropy to it. This code is not portable, but this was for a private Apache module that is only used on Linux. To preserve entropy on the web server machine, I also only generate a random number once per apache child, then increment an uint32 portion of it for each unique id call. I also have seconds and microseconds, so that's why I think it's OK to do increments from the random base, instead of generating a new random id on each request. The insecure in ap_random_insecure_bytes is there for a reason. But if you only use it once per process, anyway, it should be sufficient. The fact that several consumers (especially with multi-threaded mpms) pull from the same pool in undefined order adds some entropy, too. FWIW, there is apr_generate_random_bytes() which can do the reading of /dev/urandom for you.
Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID
Ah, I missed that. I see it's in the doxygen docs for the random module. However, the sources aren't under random/, they're under misc/. I was switching between the doxygen docs and looking at the sources; maybe when I was looking for it, I missed it because the sources were under misc/. For the standard module: you can keep the unique id size the same, but still have a root of 10 bytes, by getting rid of the thread_index: thread_index is very wasteful, using 32 bits, when only a few bits are used. Change: typedef struct { unsigned int stamp; unsigned int in_addr; unsigned int pid; unsigned short counter; unsigned int thread_index; } unique_id_rec; to: typedef struct { apr_uint32_t stamp; apr_uint32_t counter; char root[ROOT_SIZE]; } unique_id_rec; Have the two ints first for alignment purposes, so there is no padding in the struct. With the counter field using apr_uint32_t, you can use apr_atomic_inc32() to do the increments. There is less need of thread_index if you increment the counter atomically. The initialization of the counter with random data also gives you more process randomness than the 10 bytes of root. On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote: On Wednesday 26 June 2013, Daniel Lescohier wrote: When I looked into the ap random functions, I didn't like the implementation, because I didn't see anywhere in the httpd codebase that entropy is periodically added to the entropy pool. After reading the details of how the Linux entropy pool works (https://lwn .net/Articles/525204/), I decided to use /dev/urandom instead, since Linux is periodically adding entropy to it. This code is not portable, but this was for a private Apache module that is only used on Linux. To preserve entropy on the web server machine, I also only generate a random number once per apache child, then increment an uint32 portion of it for each unique id call. I also have seconds and microseconds, so that's why I think it's OK to do increments from the random base, instead of generating a new random id on each request. The insecure in ap_random_insecure_bytes is there for a reason. But if you only use it once per process, anyway, it should be sufficient. The fact that several consumers (especially with multi-threaded mpms) pull from the same pool in undefined order adds some entropy, too. FWIW, there is apr_generate_random_bytes() which can do the reading of /dev/urandom for you.
Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID
We have an internal custom Apache module that generates UUIDs; it's using a legacy format that we need for our application. A few months ago, I was upgrading it to Apache 2.4, and I worked on modifying the UUID generator function, so that some of the bytes are random (including replacing the 4 byte IPv4 address element of the uuid that we used before with random data instead). When I looked into the ap random functions, I didn't like the implementation, because I didn't see anywhere in the httpd codebase that entropy is periodically added to the entropy pool. After reading the details of how the Linux entropy pool works (https://lwn .net/Articles/525204/), I decided to use /dev/urandom instead, since Linux is periodically adding entropy to it. This code is not portable, but this was for a private Apache module that is only used on Linux. To preserve entropy on the web server machine, I also only generate a random number once per apache child, then increment an uint32 portion of it for each unique id call. I also have seconds and microseconds, so that's why I think it's OK to do increments from the random base, instead of generating a new random id on each request. Our app requires network order time_t in the second 32-bit word, so that's why the function is like below. If our app didn't require time_t, then I would not have used the high order bits of time_t, because it's fairly constant: to have a more unique id, it would be better to have more random data instead of that fairly constant data. /** * Make a new unique id as follows: * 1. Define the id. * 2. Base64-encode the id. * @param r Apache's request record. * @return The base64-encode cookie string, allocated in the request pool. */ static char *make_id( request_rec *r ) { #define RAW_COOKIE_SIZE (14) #define WANT_BASE64_COOKIE_SIZE (19) #define BASE64_ENCODE_LEN(len) (((len + 2) / 3 * 4) + 1) #define MASK_BITS(n) ((1n) - 1) struct { apr_uint32_t i1; /* b31: 0, b30-20: random, b19-0: usec; network o */ apr_uint32_t i2; /* seconds, network order */ apr_uint32_t i3; /* random, host order */ apr_uint16_t s1; /* random, host order */ } raw_cookie; apr_uint32_t i, rnd; char * const b64_cookie = apr_palloc( r-pool, BASE64_ENCODE_LEN(RAW_COOKIE_SIZE)); ap_assert(b64_cookie); if (unlikely(! rand_initialized)) { /* We delay initialization as late as possible in order * to collect more entropy in the random pool. */ init_random(r); } rnd = apr_atomic_inc32(rand_uints.i1); i = rnd MASK_BITS(11); i = 20; i += apr_time_usec(r-request_time); raw_cookie.i1 = htonl(i); raw_cookie.s1 = rnd 11; raw_cookie.i2 = htonl(apr_time_sec(r-request_time)); raw_cookie.i3 = rand_uints.i2; apr_base64_encode_binary(b64_cookie, (const void *)raw_cookie, RAW_COOKIE_SIZE); if (BASE64_ENCODE_LEN(RAW_COOKIE_SIZE) WANT_BASE64_COOKIE_SIZE) b64_cookie[WANT_BASE64_COOKIE_SIZE] = '\0'; return b64_cookie; } And supporting code: typedef struct { apr_uint32_t i1; apr_uint32_t i2; } rand_uints_t; static rand_uints_t rand_uints; static apr_thread_mutex_t *rand_mutex; static int rand_initialized; static void init_random(request_rec *r) { ap_assert(APR_SUCCESS == apr_thread_mutex_lock(rand_mutex)); if (rand_initialized == 0) { /* we won the race to initialize */ apr_file_t *f; apr_size_t bytes_read; ap_assert(APR_SUCCESS == apr_file_open(f, /dev/urandom, APR_READ|APR_BINARY, 0444, r-pool)); ap_assert(APR_SUCCESS == apr_file_read_full(f, rand_uints, sizeof(rand_uints_t), bytes_read)); ap_assert(sizeof(rand_uints) == bytes_read); ap_assert(APR_SUCCESS == apr_file_close(f)); rand_initialized = 1; } ap_assert(APR_SUCCESS == apr_thread_mutex_unlock(rand_mutex)); } /** * Initialization for each child process. */ static void init_child(apr_pool_t *p, server_rec *s) { ap_assert(APR_SUCCESS == apr_thread_mutex_create(rand_mutex, APR_THREAD_MUTEX_DEFAULT, p)); } On Wed, Jun 26, 2013 at 8:49 AM, Jan Kaluža jkal...@redhat.com wrote: Hi, currently mod_unique_id uses apr_gethostname(...) and PID pair as a base to generate unique ID. The way how it's implemented brings some problems: 1. For IPv6-only hosts it uses low-order bits of IPv6 address as if they were unique, which is wrong. 2. It relies on working DNS. It can happen that hostname does not have the IP assigned during httpd start (for example during the boot) and I think it is still valid use-case (without mod_unique_id module loaded, httpd works well in this case). 3. It calls 1 second sleep to overcome possible usage of the same PID after restart as the one used before the restart. If I'm right, we could fix the problems above by using ap_random_insecure_bytes instead of in_addr/pid pair as a base for unique ID generation. It would also make
Re: Time for 2.4.5 ??
+ is not used in url-encoding. + is used in application/x-www-form-urlencoded encoding. So I assume this function has implemented unescaping application/x-www-form-urlencoded. For regular URL-encoding, space is encoded as %20. On Fri, May 24, 2013 at 11:46 PM, Guenter Knauf fua...@apache.org wrote: Hi Daniel, On 25.05.2013 02:06, Guenter Knauf wrote: On 24.05.2013 23:45, Daniel Gruno wrote: That's fine by me, I'm not married to 'sleep' (although I do like a good nap) hehe, ok; I look into it soon. done. Found another small docu bug: r:unescape(string) -- Unescapes an URL-escaped string: local url = http%3a%2f%2ffoo.bar%2f1+2+3+**%26+4+%2b+5 local unescaped = r:escape(url) -- returns 'http://foo.bar/1 2 3 4 + 5' the function call should here be r:unescape(url) ... Gün.
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
It looks like apr_atomic is not in good-enough shape to be used for lock-free algorithms. It's probably good enough for the use-cases that the event mpm uses it for, because the event mpm is not using it for protecting memory besides the counters it's atomically changing. However, the time caches would use the cache key to protect other memory (the cache value), so apr_atomic is not good enough, because it's inconsistent on the different platforms with respect to memory barriers. For example, with x86, the add32 and cas32 functions act as a full memory barrier on all memory. However, for powerpc, the add32 and cas32 only acts as a memory barrier on the individual item being changed, it doesn't act as a memory barrier on all memory. One would have to add lightweight sync assembler instructions to the add32 and cas32 function implementations for powerpc if you want the same semantics as x86. The problem is that the apr_atomic API is not explicit with the memory ordering semantics. I agree with you that in the future the apr_atomic API should be changed to be more like the C11 atomic API, which allows explicit choice of memory ordering. But that leaves us with the problem of the current time caching code, which is not thread-safe. We should fix it using the existing APR APIs, and not use the apr_atomic API which is broken for this use-case. I found that there is a lock-free method available to us with the current APR API. We can use apr_thread_mutex_trylock. In glibc, that just attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a futex system call and a wait on the caller, so it's lock free. On Windows, TryEnterCriticalSection probably works the same way. So, the pseudo-code would be something like: For reading from cache: apr_uint32_t key = cache_element-key; /* Above is done speculatively */ if (seconds == key seconds != 0) { /* seconds == 0 may mean cache is uninitialized, so don't use cache */ *xt = cache_element-xt; /* After copying xt, make sure cache_element was not marked invalid * by another thread beginning an update, and that cache_element * really contained data for our second. */ if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) { key = cache_element-key; apr_thread_mutex_unlock(cache_element-mutex); if (key == seconds) { xt-tm_usec = (int)apr_time_usec(t); return APR_SUCCESS; } } } /* calculate the value if reached here */ Write to cache code: if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) { /* We won the race to update this cache_element. */ cache_element-key = ~seconds; /* Above marks cache_element as invalid by using ~seconds, * for the speculative reads not using the mutex. */ cache_element-xt = *xt; /* Finished copying, now update key with our key */ cache_element-key = seconds; apr_thread_mutex_unlock(cache_element-mutex); } The above is lock-free, because process scheduling decisions by the OS won't cause the current thread to block, it'll keep on doing work without blocking as long as this thread is scheduled. It would be fast, though perhaps not quite as fast than if we had an atomic api we could use effectively. With an atomic API, the read-from-time-cache code would only be reading from cache-lines, not writing to cache-lines with the futex's cmpxchg instruction. You might also be able to use a lighter-weight memory barrier. But the mutex is lock-free and fast, and it's something that'd work with today's APIs; and the existing code in production is not correct. BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc, because the readers/writers count variables are protected with a low-level lock futex that is acquired with lll_lock, not lll_trylock. So, if a thread acquires the low-level lock, but is unscheduled by the OS before it finishes updating the reader/writers count variables and unlocks the low-level lock, that would block other threads from acquiring the low-level lock, and other threads would block before being able to access the readers/writers count variables. If it had used lll_trylock instead, I think it would've been lock-free. However, I don't think rwlock would give us any particular advantage than a regular mutex for the above algorithm. So using the simpler mutex is better and is lock-free. I think the important part is to have a lock-free algorithm: if we lose a little bit of performance by not being able to use the lightest lock-free algorithm, I think that is OK. On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch s...@sfritsch.de wrote: On Sunday 06 January 2013, Daniel Lescohier wrote: I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
Some timing tests: gcc -I/usr/include/apr-1 -lrt -lapr-1 -lpthread test.c ./a.out apr_time_exp_lt: 108 nanosecs per call localtime_r: 85 nanosecs per call apr_time_exp_gmt: 59 nanosecs per call gmtime_r: 35 nanosecs per call pthread_mutex_trylock/unlock/read cache: 19 nanosecs per call lfence/read cache: 8 nanosecs per call read cache: 4 nanosecs per call cat test.c #include time.h #include stdio.h #include assert.h #include apr_time.h #include pthread.h int main (int argc, char *argv[]) { struct timespec old, new; struct tm tmv; int i, iterations=1000; long nanosecs; #define CALC_NS(new, old) ((new.tv_sec - old.tv_sec)*10 + new.tv_nsec - old.tv_nsec) apr_time_t t = apr_time_now(); apr_time_exp_t xt; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; unsigned int seconds=1; typedef struct { char s[44]; unsigned int key;} data_t; static volatile data_t cache_data; data_t my_copy; cache_data.key = 1; assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { apr_time_exp_lt(xt, t); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(apr_time_exp_lt: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { localtime_r(old.tv_sec, tmv); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(localtime_r: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { apr_time_exp_gmt(xt, t); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(apr_time_exp_gmt: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { gmtime_r(old.tv_sec, tmv); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(gmtime_r: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; if (pthread_mutex_trylock(mutex)==0) { key = cache_data.key; pthread_mutex_unlock(mutex); } } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(pthread_mutex_trylock/unlock/read cache: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; asm volatile(lfence:::memory); key = cache_data.key; } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(lfence/read cache: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; key = cache_data.key; } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(read cache: %d nanosecs per call\n, nanosecs/iterations); } On Tue, Jan 8, 2013 at 1:50 PM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: It looks like apr_atomic is not in good-enough shape to be used for lock-free algorithms. It's probably good enough for the use-cases that the event mpm uses it for, because the event mpm is not using it for protecting memory besides the counters it's atomically changing. However, the time caches would use the cache key to protect other memory (the cache value), so apr_atomic is not good enough, because it's inconsistent on the different platforms with respect to memory barriers. For example, with x86, the add32 and cas32 functions act as a full memory barrier on all memory. However, for powerpc, the add32 and cas32 only acts as a memory barrier on the individual item being changed, it doesn't act as a memory barrier on all memory. One would have to add lightweight sync assembler instructions to the add32 and cas32 function implementations for powerpc if you want the same semantics as x86. The problem is that the apr_atomic API is not explicit with the memory ordering semantics. I agree with you that in the future the apr_atomic API should be changed to be more like the C11 atomic API, which allows explicit choice of memory ordering. But that leaves us with the problem of the current time caching code, which is not thread-safe. We should fix it using the existing APR APIs, and not use
Re: event mpm and mod_status
I see that event mpm uses a worker queue that uses a condition variable, and it does a condition variable signal when something is pushed onto it. If all of the cpu cores are doing useful work, the signal is not going to force a context switch out of a thread doing useful work, the thread will continue working until it's used up it's timeslice, as long as it doesn't end early because it becomes blocked on i/o or something like that. So, in that case, the first thread to finish serving it's request and gets to ap_queue_pop_something(worker_queue...) will get the item on the queue, and it will have hot caches. On the other hand, if only some of the cpu cores are active, but the threads that are active are busy in the middle of a request, the condition variable signal will wake up one of the idle threads, the thread will be scheduled onto an idle core, and that thread can start doing useful work. That thread may have colder caches, but the other cores were busy doing useful work any way, so it's worth activating this thread on an idle core. Also, if a worker thread is idle on waiting on ap_queue_pop_something(worker_queue...), and the thread was unscheduled by the OS process scheduler, when you wake it up again, the OS process scheduler won't necessarily schedule it on the same cpu core it was on before. So, it won't necessarily have a warm cache after you wake it up again. You're only guaranteed a warm cache if the thread remains scheduled by the OS. The current queue/condvar implementation favors sending new requests to threads that remain scheduled by the OS. On Sat, Jan 5, 2013 at 9:37 AM, Jim Jagielski j...@jagunet.com wrote: +1... a lot of little improvements can result in a BIG improvement. On Jan 5, 2013, at 8:34 AM, Graham Leggett minf...@sharp.fm wrote: On 05 Jan 2013, at 2:05 AM, Stefan Fritsch s...@sfritsch.de wrote: For 1., a better thread selection would definitely be a win. For 2. and 3., it is less obvious. +1. Just because in some cases a cache might not help, doesn't mean we shouldn't take advantage of the cache when it will help. Regards, Graham --
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
I'd have to go back and review the Intel documentation to be sure, but for this particular algorithm, an explicit memory barrier may be required on Intel architecture, also. If I remember correctly, without a memory barrier, Intel arch only guarantees total memory ordering within one cache line. For this algorithm, we have an array of 16 cache_elements of 48 bytes each, so half of the cache_elements cross 64-byte cache lines. When reading the cache_element-key after the copy of the cache_element value, we need to make sure that the cache_element value read is ordered before the read of the cache_element-key, so one needs a memory barrier just before the read of the cache_element-key to guarantee the ordering. On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote: Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? ignore me, I just realized it's going to be different calls per arch anyway! -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
I was wrong about the Intel architecture: it has a pretty strong memory-ordering guarantee. But other architectures, like PowerPC and ARM, have a weak memory-ordering guarantee. Here is a good summary: https://en.wikipedia.org/wiki/Memory_ordering. The doc in footnote 3 is a good document to read. I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because you don't necessarily need a full memory barrier on each read or set. Instead, perhaps we should add three functions to APR: 1. apr_atomic_rmb(): read memory barrier. 2. apr_atomic_wmb(): write memory barrier. 3. apr_atomic_mb(): full memory barrier. As a use-case, the new util_time.c algorithm could be changed to be: For the read-from-cache portion: const apr_uint32_t key = apr_atomic_read32(cache_element-key); /* Above is done speculatively, no memory barrier used. if (seconds == key seconds != 0) { /* seconds == 0 may mean cache is uninitialized, so don't use cache */ *xt = cache_element-xt; /* After copying xt, make sure cache_element was not marked invalid * by another thread beginning an update, and that cache_element * really contained data for our second. * Requires memory barrier. */ apr_atomic_rmb(); if (apr_atomic_read32(cache_element-key)==seconds) { xt-tm_usec = (int)apr_time_usec(t); return APR_SUCCESS; } } And the write-to-cache portion would be: if (key == apr_atomic_cas32(cache_element-key, ~seconds, key)) { /* We won the race to update this cache_element. * Above marks cache_element as invalid by using ~seconds, * because we are starting an update: it's the start of a * transaction. */ cache_element-xt = *xt; /* Finished copying, now update key with our key, * ending the transaction. Need to use a * memory barrier. */ apr_atomic_wmb(); apr_atomic_set32(cache_element-key, seconds); } On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: I'd have to go back and review the Intel documentation to be sure, but for this particular algorithm, an explicit memory barrier may be required on Intel architecture, also. If I remember correctly, without a memory barrier, Intel arch only guarantees total memory ordering within one cache line. For this algorithm, we have an array of 16 cache_elements of 48 bytes each, so half of the cache_elements cross 64-byte cache lines. When reading the cache_element-key after the copy of the cache_element value, we need to make sure that the cache_element value read is ordered before the read of the cache_element-key, so one needs a memory barrier just before the read of the cache_element-key to guarantee the ordering. On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote: Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? ignore me, I just realized it's going to be different calls per arch anyway! -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
thread-safety of time conversion caches in util_time.c and mod_log_config.c
I entered a bug on the thread-safety of the time conversion caches in server/util_time.c and modules/loggers/mod_log_config.c. In brief, they're not thread-safe because: 1. The algorithm depends on total memory ordering, and both the volatile qualifier and memory barriers are not used. 2. The algorithm is subject to the ABA problem. The details of the problem are here: https://issues.apache.org/bugzilla/show_bug.cgi?id=54363 I included in the bug not-yet-tested patches with a different algorithm. Do other people agree that the existing algorithm have the problems explained in detail in the bug? Thanks, Daniel Lescohier
Re: event mpm and mod_status
I just saw this from last month from Stefan Fritsch and Niklas Edmundsson: The fact that the client ip shows up on all threads points to some potential optimization: Recently active threads should be preferred, because their memory is more likely to be in the cpu caches. Right now, the thread that has been idle the longest time will do the work. Ah, virtually guaranteeing that the thread with the coldest cache gets to do the work... I definitely agree on the potential for improvement here, you would most likely want to select either the thread that processed this request last time, or the most recently active idle thread. These two conditions kinda collides though, so the challenge is probably to come up with some rather cheap selection algorithm that is good enough. Which CPU memory caches are you referring to? 1. For stack, on a new request, you're writing a new call stack, the prior request's stack was unwound. 2. For heap, you're creating a new request pool, so you will be popping an apr memnode from the head of the allocator's free list. It may even be the same memnode that was pushed onto the free list when the prior request's apr pool was freed. 3. On a new request, the next code/instructions it will execute will be the functions that reads the http request and populates the request_rec. That's different code than happened at the end of the prior request (serving the request, logging the request, etc.). So, I'm not convinced a thread selection algorithm is needed.
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
apr_atomic_read32 is not implemented with a memory barrier. The implementation of apr_atomic_read32 in the APR code base is just a read from a pointer marked volatile. E.g., here is the atomic/unix/builtins.c implementation: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { return *mem; } The atomic/unix/builtins.c implementation would have to change to this to have a memory barrier: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { __sync_synchronize(); return *mem; } All the other apr atomic implementations would have to change if apr_atomic_read32 is supposed to mean with-a-memory-barrier. All the current implementations don't implement a memory barrier (all the implementations use the C definition shown at the beginning of this email), so I assume apr_atomic_read32 (and apr_atomic_set32) means without-a-memory-barrier. The Apache Portable Runtime API has no memory barrier function in the API (like gcc's builtin __sync_synchronize()), so I had to choose among the other functions in the API. I decided to use Compare-And-Swap for the memory barrier, because the generic Compare-And-Swap concept implies the use of memory barriers, and also the APR implementation actually implements a memory barrier in apr_atomic_cas32. I also could've used apr_atomic_add32(cache_element-key, 0)==seconds, but Compare-And-Swap is the more generic concept, so I used that instead. I agree, though, for the read-the-cache_element portion of the function, it'd be better to do just an atomic_read, instead of a compare-and-swap, if only the APR Atomic API had a memory barrier function that I could call before the read. After copying from the time cache_element to this thread's memory allocated from the request pool, the thread needs a memory barrier when checking the cache_element-key, in order to make sure another cpu thread/core didn't start modifying the cache_element while this thread was copying it. But note, before doing the copy to this thread's memory, the function doesn't use a memory barrier, it checks the cache_element-key without a memory barrier, because it's speculatively doing the copy (because almost all the time there won't be a conflict), and then doing the validation with a memory barrier after the copy is done (and in the rare case that the validation fails, the thread just calculates the time value itself). For the old code, I haven't seen bad results in practice (not that I probably would've noticed it), I only noticed that the code is not portably thread-safe because I was working on a project to upgrade our web servers to 2.4, and I was updating our own internal custom Apache modules, and I noticed this when looking at the source of mod_log_config.c. So, I brought it up because the algorithm does not look correct. The problem was that the old code was supposedly a lock-free and wait-free algorithm, but it didn't use the atomic processor instructions you need to use in order to implement lock-free and wait-free algorithms. On Fri, Jan 4, 2013 at 7:58 PM, Stefan Fritsch s...@sfritsch.de wrote: On Friday 04 January 2013, Daniel Lescohier wrote: I entered a bug on the thread-safety of the time conversion caches in server/util_time.c and modules/loggers/mod_log_config.c. In brief, they're not thread-safe because: 1. The algorithm depends on total memory ordering, and both the volatile qualifier and memory barriers are not used. 2. The algorithm is subject to the ABA problem. I agree that the lack of memory barriers is a problem. And it ABA problem would not exist if callers of ap_recent_* would actually check that the time *is* recent (as written in the docs in util_time.h). This is not a problem for the error log, because it always uses apr_time_now(). But the request time in mod_log_config may be further in the past. Out of interest, have you seen the function give wrong results in practice? The details of the problem are here: https://issues.apache.org/bugzilla/show_bug.cgi?id=54363 I included in the bug not-yet-tested patches with a different algorithm. Do other people agree that the existing algorithm have the problems explained in detail in the bug? I haven't reviewed your patches in detail, yet. One thing: apr_atomic_read should be enough, you don't need apr_atomic_cas with the same value. Cheers, Stefan