Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-28 Thread Daniel Lescohier
All writes to Linux sockets means the kernel copies to 2kiB buffers used by
SKBs.  It's copied to somewhere in the middle of that 2kiB buffer, so that
TCP/IP headers can be prepended by the kernel.  Even with TCP Segmentation
Offload, 2kiB buffers are still used; it just means that the TCP/IP headers
just need to be calculated once for an array of buffers, and then the
kernel puts an array of pointers in the network card's ring buffer.

The kernel will only put on the wire as much data as the current TCP
congestion window says, but it has to keep each packet in it's buffers
until the remote side ACKs that packet.

On Mon, Feb 27, 2017 at 2:25 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Mon, Feb 27, 2017 at 12:16 PM, Jacob Champion <champio...@gmail.com>
> wrote:
> >
> > On 02/23/2017 04:48 PM, Yann Ylavic wrote:
> >> On Wed, Feb 22, 2017 at 8:55 PM, Daniel Lescohier wrote:
> >>>
> >>>
> >>> IOW: read():Three copies: copy from filesystem cache to httpd
> >>> read() buffer to encrypted-data buffer to kernel socket buffer.
> >
> >>
> >> Not really, "copy from filesystem cache to httpd read() buffer" is
> >> likely mapping to userspace, so no copy (on read) here.
> >
> > Oh, cool. Which kernels do this? It seems like the VM tricks would have
> to
> > be incredibly intricate for this to work; reads typically don't happen in
> > page-sized chunks, nor to aligned addresses. Linux in particular has
> > comments in the source explaining that they *don't* do it for other
> syscalls
> > (e.g. vmsplice)... but I don't have much experience with non-Linux
> systems.
>
> I don't understand this claim.
>
> If read() returned an API-provisioned buffer, it could point wherever it
> liked,
> including a 4k page. As things stand the void* (or char*) of the read()
> buffer
> is at an arbitrary offset, no common OS I'm familiar with maps a page to
> a non-page-aligned address.
>
> The kernel socket send[v]() call might avoid copy in the direct-send case,
> depending on the implementation.
>


Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-24 Thread Daniel Lescohier
On Thu, Feb 23, 2017 at 7:48 PM, Yann Ylavic <ylavic@gmail.com> wrote:

> On Wed, Feb 22, 2017 at 8:55 PM, Daniel Lescohier wrote:
> >
> > IOW:
> > read():Three copies: copy from filesystem cache to httpd read() buffer to
> > encrypted-data buffer to kernel socket buffer.
>
> Not really, "copy from filesystem cache to httpd read() buffer" is
> likely mapping to userspace, so no copy (on read) here.
>
> > mmap(): Two copies: filesystem page already mapped into httpd, so just
> copy
> > from filesystem (cached) page to encrypted-data buffer to kernel socket
> > buffer.
>
> So, as you said earlier the "write to socket" isn't a copy either,
> hence both read() and mmap() implementations could work with a single
> copy when mod_ssl is involved (this is more than a copy but you
> counted it above so), and no copy at all without it.
>

When you do a write() system call to a socket, the kernel must copy the
data from the userspace buffer to it's own buffers, because of data
lifetime.  When the write() system call returns, userspace is free to
modify the buffer (which it owns).  But, the data from the last write()
call must live a long time in the kernel.  The kernel needs to keep a copy
of it until the remote system ACKs all of it.  The data is referenced first
in the kernel transmission control system, then in the network card's ring
buffers.  If the remote system's feedback indicates that a packet was
dropped or corrupted, the kernel may send it multiple times.  The data has
a different lifetime than the userspace buffer, so the kernel must copy it
to a buffer it owns.

On the userspace high-order memory allocations.  I still don't see what the
problem is. Say you're using 64kiB buffers.  When you free the buffers
(e.g., at the end of the http request), they go into the memory allocator's
64kiB free-list, and they're available to be allocated again (e.g., by
another http request).  The memory allocator won't use the 64kiB free
chunks for smaller allocations, unless the free-lists for the
smaller-orders are emptied out.  That'd mean there was a surge in demand
for smaller-size allocations, so it'd make sense to start using the
higher-order free chunk instead of calling brk().  Only if there are no
more high-order free chunks left will the allocator have to call brk().
When the kernel gets the brk() request, if the system is short of
high-order contiguous memory, it doesn't have to give contiguous-physical
pages on that brk() calls.  The Page Table Entries for that request can be
composed of many individual 4kiB pages scattered all over physical memory.
That's hidden from userspace, userspace sees a contiguous range of virtual
memory.


Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-23 Thread Daniel Lescohier
Why would high-order memory allocations be a problem in userspace code,
which is using virtual memory?  I thought high-order allocations is a big
problem in kernel space, which has to deal with physical pages.

But when you write to a socket, doesn't the kernel scatter the userspace
buffer into multiple SKBs?  SKBs on order-0 pages allocated by the kernel?


On Thu, Feb 23, 2017 at 1:16 PM, Jacob Champion 
wrote:

> On 02/23/2017 08:34 AM, Yann Ylavic wrote:
> > Actually I'm not very pleased with this solution (or the final one
> > that would make this size open / configurable).
> > The issue is potentially the huge (order big-n) allocations which
> > finally may hurt the system (fragmentation, OOM...).
>
> Power users can break the system, and this is a power tool, right? And we
> have HugeTLB kernels and filesystems to play with, with 2MB and bigger
> pages... Making all these assumptions for 90% of users is perfect for the
> out-of-the-box experience, but hardcoding them so that no one can fix
> broken assumptions seems Bad.
>
> (And don't get me wrong, I think applying vectored I/O to the brigade
> would be a great thing to try out and benchmark. I just think it's a
> long-term and heavily architectural fix, when a short-term change to get
> rid of some #defined constants could have immediate benefits.)
>
> I've no idea how much it costs to have 8K vs 16K records, though.
>> Maybe in the mod_ssl case we'd want 16K buffers, still reasonable?
>>
>
> We can't/shouldn't hardcode this especially. People who want maximum
> throughput may want nice big records, but IIRC users who want progressive
> rendering need smaller records so that they don't have to wait as long for
> the first decrypted chunk. It needs to be tunable, possibly per-location.
>
> --Jacob
>


Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-22 Thread Daniel Lescohier
On Wed, Feb 22, 2017 at 2:42 PM, Jacob Champion 
wrote:

> Ah, but they *do*, as Yann pointed out earlier. We can't just deliver the
> disk cache to OpenSSL for encryption; it has to be copied into some
> addressable buffer somewhere. That seems to be a major reason for the
> mmap() advantage, compared to a naive read() solution that just reads into
> a small buffer over and over again.
>


IOW:
read():Three copies: copy from filesystem cache to httpd read() buffer to
encrypted-data buffer to kernel socket buffer.
mmap(): Two copies: filesystem page already mapped into httpd, so just copy
from filesystem (cached) page to encrypted-data buffer to kernel socket
buffer.


Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-17 Thread Daniel Lescohier
Is the high-level issue that: for serving static content over HTTP, you can
use sendfile() from the OS filesystem cache, avoiding extra userspace
copying; but if it's SSL, or any other dynamic filtering of content, you
have to do extra work in userspace?


On Thu, Feb 16, 2017 at 6:01 PM, Yann Ylavic  wrote:

> On Thu, Feb 16, 2017 at 10:51 PM, Jacob Champion 
> wrote:
> > On 02/16/2017 02:49 AM, Yann Ylavic wrote:
> >>
> >> +#define FILE_BUCKET_BUFF_SIZE (64 * 1024 - 64) /* >
> APR_BUCKET_BUFF_SIZE
> >> */
> >
> >
> > So, I had already hacked my O_DIRECT bucket case to just be a copy of
> APR's
> > file bucket, minus the mmap() logic. I tried making this change on top of
> > it...
> >
> > ...and holy crap, for regular HTTP it's *faster* than our current mmap()
> > implementation. HTTPS is still slower than with mmap, but faster than it
> was
> > without the change. (And the HTTPS performance has been really variable.)
> >
> > Can you confirm that you see a major performance improvement with the
> with
> > the new 64K file buffer?
>
> I can't test speed for now (stick with my laptop/localhost, which
> won't be relevant enough I guess).
>
> > I'm pretty skeptical of my own results at this
> > point... but if you see it too, I think we need to make *all* these
> > hard-coded numbers tunable in the config.
>
> We could also improve the apr_bucket_alloc()ator to recycle more
> order-n allocations possibilities (saving as much
> {apr_allocator_,m}alloc() calls), along with configurable/higher
> orders in httpd that'd be great I think.
>
> I can try this patch...
>


Re: httpd 2.4.25, mpm_event, ssl: segfaults

2017-02-07 Thread Daniel Lescohier
Here is how cache page replacement is done in Linux:
https://linux-mm.org/PageReplacementDesign


On Tue, Feb 7, 2017 at 5:32 AM, Niklas Edmundsson  wrote:

> On Mon, 6 Feb 2017, Jacob Champion wrote:
>
> 
>
> Considering the massive amount of caching that's built into the entire
>> HTTP ecosystem already, O_DIRECT *might* be an effective way to do that (in
>> which we give up filesystem optimizations and caching in return for a DMA
>> into userspace). I have a PoC about halfway done, but I need to split my
>> time this week between this and the FCGI stuff I've been neglecting.
>>
>
> As O_DIRECT bypasses the cache, all your IO will hit your storage device.
> I think this makes it useful only in exotic usecases.
>
> For our workload that's using mod_cache with a local disk cache we still
> want to use the remaining RAM as a disk read cache, it doesn't make sense
> to force disk I/O for files that simply could have been served from RAM. We
> typically see 90-95% of the requests served by mod_cache actually being
> served from the disk cache in RAM rather than hitting the local disk cache
> devices. I'm suspecting the picture is similar for most occurrances of
> static file serving, regardless of using mod_cache etc or not.
>
> O_DIRECT also bypasses any read-ahead logic, so you'll have to do nice and
> big IO etc to get good performance.
>
> We've played around with O_DIRECT to optimize the caching process in our
> large-file caching module (our backing store is nfs). However, since all
> our hosts are running Linux we had much better results with doing plain
> reads and utilizing posix_fadvise with POSIX_FADV_WILLNEED to trigger
> read-ahead and POSIX_FADV_DONTNEED to drop the original file from cache
> when read (as future requests will be served from local disk cache). We're
> doing 8MB fadvise chunks to get full streaming performance when caching
> large files.
>
> But that's way out of scope for this discussion, I think ;-)
>
> In conclusion, I wouldn't expect any benefits of using O_DIRECT in the
> common httpd usecases. That said, I would gladly be proven wrong :)
>
> /Nikke
> --
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> -=-=-=-=-=-=-=-
>  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se  | ni...@acc.umu.se
> 
> ---
>  "The point is I am now a perfectly safe penguin!" -- Ford Prefect
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> =-=-=-=-=-=-=-=
>


Re: allow newlines in T_ESCAPE_LOGITEM?

2016-04-19 Thread Daniel Lescohier
Of course, if you use "reliable piped logging" then if you also write more
than PIPE_BUF bytes (4kiB on Linux) in a log line, it's not guaranteed to
be atomic.

I have been thinking of how to work around that limit, by logging to an
Apache Flume Thrift RPC port.  But to avoid the 4kiB limit you'd have to
put the Thrift client inside the httpd server.  You'd really want to have a
dedicated thread in each child process for the Thrift client, so that it
can accumulate a batch of messages for the Flume server: otherwise, you'd
be putting a lot of transaction load on Apache Flume.  So the httpd threads
would put their log lines on a synchronized queue consumed by the logging
thread.  Each child process would have it's own TCP connection to Apache
Flume, so there's none of the atomic message issues you have with >
PIPE_BUF messages using a Pipe.

It'd also be nice if Apache Thrift had a c_apr target in addition to the
c_glib target, so that you don't have to bring in glib to use it.

Anyway, it seemed to be some work to implement, so I think I will just live
with the limit of 4kiB atomic log entries.

Here is the Flume Thrift protocol:

namespace java org.apache.flume.thrift

struct ThriftFlumeEvent {
  1: required map  headers,
  2: required binary body,
}

enum Status {
  OK,
  FAILED,
  ERROR,
  UNKNOWN
}

service ThriftSourceProtocol {
  Status append(1: ThriftFlumeEvent event),
  Status appendBatch(1: list events),
}


On Mon, Apr 18, 2016 at 4:37 PM, William A Rowe Jr 
wrote:

> That doesn't work because it's no longer atomic, those log lines end up
> fanned out among all the other worker threads doing the same thing.  Very
> hard to parse.
>
> On Mon, Apr 18, 2016 at 3:08 PM, Stefan Fritsch  wrote:
>
>> On Monday 18 April 2016 10:42:23, Eric Covener wrote:
>> > I'm won over and not tinkering with this FWIW.
>>
>> Maybe add a utility function that splits a message on \n and calls
>> ap_log_rerror for each line?
>>
>
>


Re: allow newlines in T_ESCAPE_LOGITEM?

2016-04-18 Thread Daniel Lescohier
Doing this in the httpd server will prevent use-cases of programmatically
parsing the log files.  I think the data in the file should support both
use-cases

   - Parsing the log file data for data analysis.
   - Humans reading the logs.

So, the raw data should support both use-cases.

For a human reading the logs, that should be handled by an application that
parses the raw data and presents it in human readable format.  Here is a
simple application:

#!/usr/bin/python
import sys, fileinput
write = sys.stdout.write
separator = '-'*78 + '\n'
for line in fileinput.input():
lines = line.decode("string_escape")
write(lines)
if lines.count('\n') > 1:
write(separator)

You could also have an application with much more functionality,
color-coding fields, using underline to separate records instead of dashes,
whatever you want.

The above application may not meet exactly what the OP wants: but that's my
point: each user's use-case is different, and the raw data should support
all the use-cases.



On Mon, Apr 18, 2016 at 8:28 AM, Stefan Eissing <
stefan.eiss...@greenbytes.de> wrote:

> Would it make sense to replace \n with \n\t?
>
> > Am 18.04.2016 um 14:23 schrieb Jim Jagielski :
> >
> > No opinion :)
> >
> >> On Apr 13, 2016, at 3:43 PM, Eric Covener  wrote:
> >>
> >> Currently newlines get backslash-escaped if written to the errorlog.
> >> This is via server/gen_test_char.c and stems from an ancient vuln
> >> about escape sequences in log files potentially affecting peoples
> >> terminals when cat'ed.
> >>
> >> On a few occasions I have worked with some libraries that return a
> >> newline-formatted buffer that I'd love to dump to the error log
> >> without scanning and splitting up into multiple log entries.
> >>
> >> I also occasionally see \n's from third-party mods once in a while.
> >>
> >> Any opinions on punching a hole for \n in T_ESCAPE_LOGITEM?
> >
>
>


Re: allow newlines in T_ESCAPE_LOGITEM?

2016-04-13 Thread Daniel Lescohier
It's especially important when doing log processing on Apache Hadoop, if
you give uncompressed text files as input files to a Hadoop job, it'd split
large log files on newlines to be processed on multiple nodes.  That split
should be done on a record boundary.

On Wed, Apr 13, 2016 at 5:16 PM, Yann Ylavic <ylavic@gmail.com> wrote:

> On Wed, Apr 13, 2016 at 11:08 PM, Eric Covener <cove...@gmail.com> wrote:
> > On Wed, Apr 13, 2016 at 5:05 PM, Daniel Lescohier
> > <daniel.lescoh...@cbsi.com> wrote:
> >> Isn't T_ESCAPE_LOGITEM also used by mod_log_config's use of
> >> ap_escape_logitem?  We rely on the API that data from HTTP requests
> that are
> >> logged in our mod_log_config logfiles are newline-escaped, so that one
> line
> >> in the logfile is parsed as one log entry.  Our parsers first split on
> >> newline to get records, then splits the fields of the record on the
> field
> >> delimiter to get fields, then it unescapes the backslash-escapes to get
> the
> >> original data for that field.
> >
> > You make a good point, it couldn't change and affect current callers
> > of ap_escape_logitem().
>
> IMHO, even ErrorLog shouldn't contain splitted lines (w/o "[date]
> [level] [pid]" prefix).
>


Re: allow newlines in T_ESCAPE_LOGITEM?

2016-04-13 Thread Daniel Lescohier
Isn't T_ESCAPE_LOGITEM also used by mod_log_config's use of
ap_escape_logitem?  We rely on the API that data from HTTP requests that
are logged in our mod_log_config logfiles are newline-escaped, so that one
line in the logfile is parsed as one log entry.  Our parsers first split on
newline to get records, then splits the fields of the record on the field
delimiter to get fields, then it unescapes the backslash-escapes to get the
original data for that field.

On Wed, Apr 13, 2016 at 3:43 PM, Eric Covener  wrote:

> Currently newlines get backslash-escaped if written to the errorlog.
> This is via server/gen_test_char.c and stems from an ancient vuln
> about escape sequences in log files potentially affecting peoples
> terminals when cat'ed.
>
> On a few occasions I have worked with some libraries that return a
> newline-formatted buffer that I'd love to dump to the error log
> without scanning and splitting up into multiple log entries.
>
> I also occasionally see \n's from third-party mods once in a while.
>
> Any opinions on punching a hole for \n in T_ESCAPE_LOGITEM?
>


Re: Do pools lead to bad programming?

2013-12-13 Thread Daniel Lescohier
The format string is a multiplier of length.  4k repeated %Y elements in
the 'a' variable's format string is 8kbytes in the format string, but the
result would take 16kbytes.  Then you have things like %B: the full month
name according to the current locale.  You cannot really predict the length
without actually doing the strftime.

I figured 256 bytes is more than reasonable.  Normal time strings are  32
bytes; double that, and quadruple that again (e.g. if there are utf-8 chars
in the string), and you get 256 bytes.  If people need more, one can use in
LogFormat %{formatstring1}t%{formatstring2}t.


On Fri, Dec 13, 2013 at 10:14 AM, Yann Ylavic ylavic@gmail.com wrote:

 On Fri, Dec 13, 2013 at 5:06 AM, Daniel Lescohier 
 daniel.lescoh...@cbsi.com wrote:

 Here is my draft replacement:

 static const char *log_request_time_custom(request_rec *r, char *a,
apr_time_exp_t *xt)
 {
 static const apr_size_t buf_len = 256;
 apr_size_t tstr_len;
 char *tstr = apr_palloc(r-pool, buf_len);

 apr_strftime(tstr, tstr_len, buf_len, a, xt);
 return tstr_len ? tstr : -;
 }


 Shouldn't [MAX_STRING_LEN  buf_len  strlen(a)] for apr_strftime() not to
 truncate the result (for any resonnable 'a')?




Re: Do pools lead to bad programming?

2013-12-12 Thread Daniel Lescohier
We can also save stack space by changing:
char server_portstr[32];
to:
char server_portstr[6];
Or if we want to future-proof against the small possibility of a new TCP
standard that has larger port numbers and negative port numbers:
char server_portstr[sizeof(apr_port_t)*241/100+3]; /* log10(256) is
2.408 */

As part of my time caching work, I'm fixing a memory buffer issue in
mod_log_config.c.  Here is the current code:

static const char *log_request_time_custom(request_rec *r, char *a,
   apr_time_exp_t *xt)
{
apr_size_t retcode;
char tstr[MAX_STRING_LEN];
apr_strftime(tstr, retcode, sizeof(tstr), a, xt);
return apr_pstrdup(r-pool, tstr);
}

This function needs the string on the heap, because the string is returned
to the caller.  By first using stack memory, one has to add a memory copy
before returning.  Also, this expands the stack by 8KiB, and stack might be
a limited resource on some OSes.  The final problem is that the retcode
isn't checked: if the retcode is 0, the content of the buffer is undefined,
and shouldn't be used; in theory, even 8KiB might not be a big enough
buffer.

Here is my draft replacement:

static const char *log_request_time_custom(request_rec *r, char *a,
   apr_time_exp_t *xt)
{
static const apr_size_t buf_len = 256;
apr_size_t tstr_len;
char *tstr = apr_palloc(r-pool, buf_len);
apr_strftime(tstr, tstr_len, buf_len, a, xt);
return tstr_len ? tstr : -;
}



On Thu, Dec 12, 2013 at 7:37 AM, Yann Ylavic ylavic@gmail.com wrote:

 On Thu, Dec 12, 2013 at 1:54 AM, Kean Johnston kean.johns...@gmail.comwrote:

 I'd love to see these things fixed, because they add up. If you post them
 here they are likely to be reviewed very quickly, as they'll no doubt be
 simple to review.

 Cool. Here's a patch for the case I just mentioned. It also eliminates an
 un-needed automatic (yes, I obsess about stack frames too). diff against
 trunk.


 Note that in this particular cases (ie. uri = apr_palloc(p,
 sizeof(*uri)) used in proxy_(ajp|fcgi|http|scgi|wstunnel)_handler), the
 local uri would better be declared on the stack (ie. apr_uri_t uri; and
 uri used accordingly), so to avoid the allocation completely.

 Regards,
 Yann.




Re: time caching in util_time.c and mod_log_config.c

2013-12-04 Thread Daniel Lescohier
On Wed, Dec 4, 2013 at 7:47 AM, Jim Jagielski j...@jagunet.com wrote:

 Also, IMO, the default should be non-portable
 atomics.


Yes: that is the purpose of APR: having a portable API on top of
non-portable implementations for each platform.


Re: time caching in util_time.c and mod_log_config.c

2013-12-04 Thread Daniel Lescohier
So it sounds like I should go ahead and work on an implementation of the
time caching using apr_atomic_cas32 and apr_atomic_dec32.  This won't be an
issue for RHEL/CentOS/etc., because they're using old versions of httpd.
We can put something in the release notes saying that for 32-bit i486,
i586, i686 builds, you should build APR with --enable-nonportable-atomics,
for distro maintainers info when they package the new version.



On Wed, Dec 4, 2013 at 7:47 AM, Jim Jagielski j...@jagunet.com wrote:

 Adding APR dev list:

 IMO, httpd should expect APR to do the right thing. If APR
 isn't doing that, then it's an APR bug and needs to be fixed/
 addressed within APR.

 All this implies that the atomics code in APR needs a serious
 review and update.

 We should also look into leveraging what we can from stdcxx
 (https://svn.apache.org/repos/asf/stdcxx/trunk/src) as well
 as OPA (https://trac.mcs.anl.gov/projects/openpa/wiki/FAQ).

 Also, IMO, the default should be non-portable
 atomics.

 On Dec 3, 2013, at 7:41 PM, Daniel Lescohier daniel.lescoh...@cbsi.com
 wrote:

  So I think we should reach a consensus on what approach to take.  My
 goal was to implement an algorithm that is correct, with code that is easy
 to maintain.  I think using the apr_atomics functions meets those goals the
 best.  The downside are for those systems that are running 32-bit i486,
 i586, i686 systems where the default APR configure setting was not
 overridden for atomics.  There may be i686 servers still out there using
 32-bit web server, probably memory-constrained systems like VPS hosts; the
 question is have they overridden the APR default configuration or not.
 
  Should we hold back on fixing this because of these systems?  If we go
 forward, should there be something in the release notes warning of this APR
 configuration issue?
 
 
 
  On Tue, Dec 3, 2013 at 7:15 PM, Daniel Lescohier 
 daniel.lescoh...@cbsi.com wrote:
  (continued, hit send too early)
 
  %ix86   i386 i486 i586 i686 pentium3 pentium4 athlon geode
 
  However, I looked at the CentOS 6 apr.spec, and it's not overriding the
 default.
 
 




Re: time caching in util_time.c and mod_log_config.c

2013-12-03 Thread Daniel Lescohier
I infer that you think that in this particular case, the function macro
makes the code more readable and maintainable.  I don't think one can
define an absolute rule, it's a judgment call whether a macro increases or
decreases clarity and maintainability.  It reminds me of 'goto': most of
the time, one shouldn't use it, but there are cases where it's better than
the alternative.


On Tue, Dec 3, 2013 at 10:01 AM, Jim Jagielski j...@jagunet.com wrote:

 At first blush, the below looks very workable! On a slightly
 off-topic topic, however, I wonder if macros are the way to
 go.  Long long ago function calls were really really expensive.
 Does the decreased clarity (and debugging capability) really
 offset the saved cycles? Agreed that it's not fully applicable
 below.

 On Dec 2, 2013, at 6:13 PM, Daniel Lescohier 
 daniel.lescoh...@cbsinteractive.com wrote:

  The current time caching implementation in util_time.c and
 mod_log_config.c is not guaranteed to work with all compilers and cpu
 architectures.  I have some proposed code I've written, that I want to get
 input from the mailing list on, before continuing on to compile and test it.
 
  The old implementation tries to do a wait-free algorithm, but
 incorrectly, because the algorithm relies on total memory ordering, and the
 implementation does not guarantee total memory ordering from the compiler
 or CPU.
 
  My proposal is to use the standard and portable apr_thread_mutex_trylock
 for enforcing the memory barriers.  For APR's Linux x86_64 implementation,
 this basically turns into a lock: cmpxchgl instruction, and a lock: decl
 instruction for the unlock.  Because only trylock is used, not lock, the
 futex is never spun and arbitrated on by the kernel, it's all done in
 userspace (if there's contention, each thread just calculates the value
 itself instead of reading from the cache).  So the replacement is also a
 wait-free algorithm, using the standard and portable apr_thread_mutex
 calls.  Also, the old algorithm does two memcpy operations when reading
 from the cache, while the new algorithm just does one.  For
 APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done.
 
  The first part of the code is generic time caching code I've written in
 util_time_caching.h. I use these macros to create five different time
 caches:
 
  excerpt w/o boilerplate:
 
  ---
  #define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\
  static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER];
 
  #define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC)\
  static apr_status_t (*FUNC_PTR)(VALUE_T *, apr_time_t) = CALC_FUNC;
 
  /* TIME_CACHE_FUNCTION macro:
   * The cache is implemented as a ring buffer. The key in the
   * cache element indicates which second the cache value is for.
   * We use a wait-free algorithm: if we cannot access the cache,
   * we just calculate the value, doing useful work, instead of
   * spinning trying to access the cache.
   * We only update the cache with newer times, because older times
   * should have a lower cache hit ratio, and we want to keep the
   * caches small to fit in the CPU L1/L2 caches.
   */
 
  #define TIME_CACHE_FUNCTION(FUNC_NAME, VALUE_T, VALUE_SIZE, \
  CACHE_T, CACHE_PTR, CACHE_SIZE_POWER, \
  CALC_FUNC, TRY_LOCK, RELEASE_LOCK, AFTER_READ_WORK)\
  static apr_status_t FUNC_NAME(VALUE_T *value, apr_time_t t)\
  {\
  const apr_int64_t seconds = apr_time_sec(t);\
  apr_status_t status;\
  CACHE_T * const cache_element = \
  (CACHE_PTR[seconds  ((1CACHE_SIZE_POWER)-1)]);\
  /* seconds==0 can be confused with unitialized cache; don't use
 cache */\
  if (seconds==0) return CALC_FUNC(value, t);\
  if (TRY_LOCK) {\
  if (seconds == cache_element-key) {\
  memcpy(value, cache_element-value, VALUE_SIZE);\
  RELEASE_LOCK;\
  AFTER_READ_WORK;\
  return APR_SUCCESS;\
  }\
  if (seconds  cache_element-key) {\
  RELEASE_LOCK;\
  return CALC_FUNC(value, t);\
  }\
  RELEASE_LOCK;\
  }\
  status = CALC_FUNC(value, t);\
  if (status == APR_SUCCESS) {\
  if (TRY_LOCK) {\
  if (seconds  cache_element-key) {\
  cache_element-key = seconds;\
  memcpy(cache_element-value, value, VALUE_SIZE);\
  }\
  RELEASE_LOCK;\
  }\
  }\
  return status;\
  }
  --
 
  Here is an example of implementing one of the caches in util_time.c:
 
  --
  /* Have small sized caches, to stay in CPU's L1/L2 caches.
   * There should be few requests older than 3 seconds, so set
   * cache size power to 2.
   */
  #define TIME_CACHE_SIZE_POWER 2
 
  typedef struct {
  apr_int64_t key;
  apr_time_exp_t value;
  } explode_time_cache_t;
 
  TIME_CACHE(explode_time_cache_t

Re: time caching in util_time.c and mod_log_config.c

2013-12-03 Thread Daniel Lescohier
If the developers list is OK using apr_atomic in the server core, there
would be lots of advantages over trylock:

   1. No need for child init.
   2. No need for function pointers.
   3. Could have a lock per cache element (I deemed it too expensive
   memory-wise to have a large mutex structure per cache element).
   4. It would avoid the problem of trylock not being implemented on all
   platforms.
   5. Fewer parameters to the function macro.

The code would be like this:

#define TIME_CACHE_FUNCTION(VALUE_SIZE, CACHE_T, CACHE_PTR,
CACHE_SIZE_POWER,\
CALC_FUNC, AFTER_READ_WORK\
)\
const apr_int64_t seconds = apr_time_sec(t);\
apr_status_t status;\
CACHE_T * const cache_element = \
(CACHE_PTR[seconds  ((1CACHE_SIZE_POWER)-1)]);\
/* seconds==0 can be confused with unitialized cache; don't use cache
*/\
if (seconds==0) return CALC_FUNC(value, t);\
if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\
if (seconds == cache_element-key) {\
memcpy(value, cache_element-value, VALUE_SIZE);\
apr_atomic_dec32(cache_element-lock);\
AFTER_READ_WORK;\
return APR_SUCCESS;\
}\
if (seconds  cache_element-key) {\
apr_atomic_dec32(cache_element-lock);\
return CALC_FUNC(value, t);\
}\
apr_atomic_dec32(cache_element-lock);\
}\
status = CALC_FUNC(value, t);\
if (status == APR_SUCCESS) {\
if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\
if (seconds  cache_element-key) {\
cache_element-key = seconds;\
memcpy(cache_element-value, value, VALUE_SIZE);\
}\
apr_atomic_dec32(cache_element-lock);\
}\
}\
return status;

--

typedef struct {
apr_int64_t key;
apr_uint32_t lock;
apr_time_exp_t value;
} explode_time_cache_t;

TIME_CACHE(explode_time_cache_t, explode_time_lt_cache,
   TIME_CACHE_SIZE_POWER)

AP_DECLARE(apr_status_t) ap_explode_recent_localtime(
apr_time_exp_t * value, apr_time_t t)
{
TIME_CACHE_FUNCTION(
sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache,
TIME_CACHE_SIZE_POWER, apr_time_exp_lt,
value-tm_usec = (apr_int32_t) apr_time_usec(t))
}



On Tue, Dec 3, 2013 at 10:53 AM, Daniel Lescohier daniel.lescoh...@cbsi.com
 wrote:

 I infer that you think that in this particular case, the function macro
 makes the code more readable and maintainable.  I don't think one can
 define an absolute rule, it's a judgment call whether a macro increases or
 decreases clarity and maintainability.  It reminds me of 'goto': most of
 the time, one shouldn't use it, but there are cases where it's better than
 the alternative.


 On Tue, Dec 3, 2013 at 10:01 AM, Jim Jagielski j...@jagunet.com wrote:

 At first blush, the below looks very workable! On a slightly
 off-topic topic, however, I wonder if macros are the way to
 go.  Long long ago function calls were really really expensive.
 Does the decreased clarity (and debugging capability) really
 offset the saved cycles? Agreed that it's not fully applicable
 below.

 On Dec 2, 2013, at 6:13 PM, Daniel Lescohier 
 daniel.lescoh...@cbsinteractive.com wrote:

  The current time caching implementation in util_time.c and
 mod_log_config.c is not guaranteed to work with all compilers and cpu
 architectures.  I have some proposed code I've written, that I want to get
 input from the mailing list on, before continuing on to compile and test it.
 
  The old implementation tries to do a wait-free algorithm, but
 incorrectly, because the algorithm relies on total memory ordering, and the
 implementation does not guarantee total memory ordering from the compiler
 or CPU.
 
  My proposal is to use the standard and portable
 apr_thread_mutex_trylock for enforcing the memory barriers.  For APR's
 Linux x86_64 implementation, this basically turns into a lock: cmpxchgl
 instruction, and a lock: decl instruction for the unlock.  Because only
 trylock is used, not lock, the futex is never spun and arbitrated on by the
 kernel, it's all done in userspace (if there's contention, each thread just
 calculates the value itself instead of reading from the cache).  So the
 replacement is also a wait-free algorithm, using the standard and portable
 apr_thread_mutex calls.  Also, the old algorithm does two memcpy operations
 when reading from the cache, while the new algorithm just does one.  For
 APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done.
 
  The first part of the code is generic time caching code I've written in
 util_time_caching.h. I use these macros to create five different time
 caches:
 
  excerpt w/o boilerplate:
 
  ---
  #define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\
  static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER];
 
  #define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC

Re: time caching in util_time.c and mod_log_config.c

2013-12-03 Thread Daniel Lescohier
I took a look at apr's configure.in, and it's default for all architectures
except for i486, i586, and i686 is to use the real atomic ops, but for
those three architectures the default is to use the generic atomic ops.
Any idea why there is a special rule for those three architectures?
There's nothing wrong with the atomic operations on those three
architectures: otherwise, how have we had semaphores and mutexes for all
these years on those CPUs?  I guess that is a question for the APR dev
mailing list.

I see that some distros override that default.  E.g., the libapr1.spec for
openSUSE has:

%ifarch %ix86
--enable-nonportable-atomics=yes \
%endif

and in /usr/lib/rpm/macros:


On Tue, Dec 3, 2013 at 12:54 PM, Yann Ylavic ylavic@gmail.com wrote:

 I personnally like this solution better (IMHO) since it does not rely on
 apr_thread_mutex_trylock() to be wait-free/userspace (eg. natively
 implements the compare and swap).

 On the other hand, apr_atomic_cas32() may itself be implemented using
 apr_thread_mutex_lock() when USE_ATOMICS_GENERIC is defined (explicitly, or
 with --enable-nonportable-atomics=no, or else forcibly with gcc -stdc=c89
 or intel cpus = i686).

 Hence with USE_ATOMICS_GENERIC, apr_thread_mutex_trylock() may be a better
 solution than the apr_thread_mutex_lock()...



 On Tue, Dec 3, 2013 at 6:01 PM, Daniel Lescohier 
 daniel.lescoh...@cbsi.com wrote:

 If the developers list is OK using apr_atomic in the server core, there
 would be lots of advantages over trylock:

1. No need for child init.
2. No need for function pointers.
3. Could have a lock per cache element (I deemed it too expensive
memory-wise to have a large mutex structure per cache element).
4. It would avoid the problem of trylock not being implemented on all
platforms.
5. Fewer parameters to the function macro.

 The code would be like this:

 #define TIME_CACHE_FUNCTION(VALUE_SIZE, CACHE_T, CACHE_PTR,
 CACHE_SIZE_POWER,\
 CALC_FUNC, AFTER_READ_WORK\

 )\
 const apr_int64_t seconds = apr_time_sec(t);\
 apr_status_t status;\
 CACHE_T * const cache_element = \
 (CACHE_PTR[seconds  ((1CACHE_SIZE_POWER)-1)]);\
 /* seconds==0 can be confused with unitialized cache; don't use cache
 */\
 if (seconds==0) return CALC_FUNC(value, t);\
 if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\

 if (seconds == cache_element-key) {\
 memcpy(value, cache_element-value, VALUE_SIZE);\
 apr_atomic_dec32(cache_element-lock);\

 AFTER_READ_WORK;\
 return APR_SUCCESS;\
 }\
 if (seconds  cache_element-key) {\
 apr_atomic_dec32(cache_element-lock);\
 return CALC_FUNC(value, t);\
 }\
 apr_atomic_dec32(cache_element-lock);\

 }\
 status = CALC_FUNC(value, t);\
 if (status == APR_SUCCESS) {\
 if (apr_atomic_cas32(cache_element-lock, 1, 0)==0) {\

 if (seconds  cache_element-key) {\
 cache_element-key = seconds;\
 memcpy(cache_element-value, value, VALUE_SIZE);\
 }\
 apr_atomic_dec32(cache_element-lock);\
 }\
 }\
 return status;

 --

 typedef struct {
 apr_int64_t key;
 apr_uint32_t lock;

 apr_time_exp_t value;
 } explode_time_cache_t;

 TIME_CACHE(explode_time_cache_t, explode_time_lt_cache,
TIME_CACHE_SIZE_POWER)

 AP_DECLARE(apr_status_t) ap_explode_recent_localtime(
 apr_time_exp_t * value, apr_time_t t)
 {
 TIME_CACHE_FUNCTION(

 sizeof(apr_time_exp_t), explode_time_cache_t, explode_time_lt_cache,
 TIME_CACHE_SIZE_POWER, apr_time_exp_lt,
 value-tm_usec = (apr_int32_t) apr_time_usec(t))
 }




Re: time caching in util_time.c and mod_log_config.c

2013-12-03 Thread Daniel Lescohier
(continued, hit send too early)

%ix86   i386 i486 i586 i686 pentium3 pentium4 athlon geode

However, I looked at the CentOS 6 apr.spec, and it's not overriding the
default.


Re: time caching in util_time.c and mod_log_config.c

2013-12-03 Thread Daniel Lescohier
So I think we should reach a consensus on what approach to take.  My goal
was to implement an algorithm that is correct, with code that is easy to
maintain.  I think using the apr_atomics functions meets those goals the
best.  The downside are for those systems that are running 32-bit i486,
i586, i686 systems where the default APR configure setting was not
overridden for atomics.  There may be i686 servers still out there using
32-bit web server, probably memory-constrained systems like VPS hosts; the
question is have they overridden the APR default configuration or not.

Should we hold back on fixing this because of these systems?  If we go
forward, should there be something in the release notes warning of this APR
configuration issue?



On Tue, Dec 3, 2013 at 7:15 PM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 (continued, hit send too early)

 %ix86   i386 i486 i586 i686 pentium3 pentium4 athlon geode

 However, I looked at the CentOS 6 apr.spec, and it's not overriding the
 default.




time caching in util_time.c and mod_log_config.c

2013-12-02 Thread Daniel Lescohier
The current time caching implementation in util_time.c and mod_log_config.c
is not guaranteed to work with all compilers and cpu architectures.  I have
some proposed code I've written, that I want to get input from the mailing
list on, before continuing on to compile and test it.

The old implementation tries to do a wait-free algorithm, but incorrectly,
because the algorithm relies on total memory ordering, and the
implementation does not guarantee total memory ordering from the compiler
or CPU.

My proposal is to use the standard and portable apr_thread_mutex_trylock
for enforcing the memory barriers.  For APR's Linux x86_64 implementation,
this basically turns into a lock: cmpxchgl instruction, and a lock: decl
instruction for the unlock.  Because only trylock is used, not lock, the
futex is never spun and arbitrated on by the kernel, it's all done in
userspace (if there's contention, each thread just calculates the value
itself instead of reading from the cache).  So the replacement is also a
wait-free algorithm, using the standard and portable apr_thread_mutex
calls.  Also, the old algorithm does two memcpy operations when reading
from the cache, while the new algorithm just does one.  For
APR_HAS_THREADS==0 or a non-threaded mpm in use, no locking is done.

The first part of the code is generic time caching code I've written in
util_time_caching.h. I use these macros to create five different time
caches:

excerpt w/o boilerplate:

---
#define TIME_CACHE(CACHE_T, CACHE_PTR, CACHE_SIZE_POWER)\
static CACHE_T CACHE_PTR[1CACHE_SIZE_POWER];

#define TIME_CACHE_FUNC_PTR(FUNC_PTR, VALUE_T, CALC_FUNC)\
static apr_status_t (*FUNC_PTR)(VALUE_T *, apr_time_t) = CALC_FUNC;

/* TIME_CACHE_FUNCTION macro:
 * The cache is implemented as a ring buffer. The key in the
 * cache element indicates which second the cache value is for.
 * We use a wait-free algorithm: if we cannot access the cache,
 * we just calculate the value, doing useful work, instead of
 * spinning trying to access the cache.
 * We only update the cache with newer times, because older times
 * should have a lower cache hit ratio, and we want to keep the
 * caches small to fit in the CPU L1/L2 caches.
 */

#define TIME_CACHE_FUNCTION(FUNC_NAME, VALUE_T, VALUE_SIZE, \
CACHE_T, CACHE_PTR, CACHE_SIZE_POWER, \
CALC_FUNC, TRY_LOCK, RELEASE_LOCK, AFTER_READ_WORK)\
static apr_status_t FUNC_NAME(VALUE_T *value, apr_time_t t)\
{\
const apr_int64_t seconds = apr_time_sec(t);\
apr_status_t status;\
CACHE_T * const cache_element = \
(CACHE_PTR[seconds  ((1CACHE_SIZE_POWER)-1)]);\
/* seconds==0 can be confused with unitialized cache; don't use cache
*/\
if (seconds==0) return CALC_FUNC(value, t);\
if (TRY_LOCK) {\
if (seconds == cache_element-key) {\
memcpy(value, cache_element-value, VALUE_SIZE);\
RELEASE_LOCK;\
AFTER_READ_WORK;\
return APR_SUCCESS;\
}\
if (seconds  cache_element-key) {\
RELEASE_LOCK;\
return CALC_FUNC(value, t);\
}\
RELEASE_LOCK;\
}\
status = CALC_FUNC(value, t);\
if (status == APR_SUCCESS) {\
if (TRY_LOCK) {\
if (seconds  cache_element-key) {\
cache_element-key = seconds;\
memcpy(cache_element-value, value, VALUE_SIZE);\
}\
RELEASE_LOCK;\
}\
}\
return status;\
}
--

Here is an example of implementing one of the caches in util_time.c:

--
/* Have small sized caches, to stay in CPU's L1/L2 caches.
 * There should be few requests older than 3 seconds, so set
 * cache size power to 2.
 */
#define TIME_CACHE_SIZE_POWER 2

typedef struct {
apr_int64_t key;
apr_time_exp_t value;
} explode_time_cache_t;

TIME_CACHE(explode_time_cache_t, explode_time_lt_cache,
   TIME_CACHE_SIZE_POWER)
TIME_CACHE_FUNC_PTR(explode_time_lt_ptr, apr_time_exp_t,
apr_time_exp_lt)
TIME_CACHE_FUNCTION(
explode_time_lt_nolocking, apr_time_exp_t, sizeof(apr_time_exp_t),
explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER,
apr_time_exp_lt, 1,,
value-tm_usec = (apr_int32_t) apr_time_usec(t))
#if APR_HAS_THREADS
static apr_thread_mutex_t *explode_time_lt_lock;
TIME_CACHE_FUNCTION(
explode_time_lt_locking, apr_time_exp_t, sizeof(apr_time_exp_t),
explode_time_cache_t, explode_time_lt_cache, TIME_CACHE_SIZE_POWER,
apr_time_exp_lt,
apr_thread_mutex_trylock(explode_time_lt_lock)==APR_SUCCESS,
apr_thread_mutex_unlock(explode_time_lt_lock),
value-tm_usec = (apr_int32_t) apr_time_usec(t)
#endif

AP_DECLARE(apr_status_t) ap_explode_recent_localtime(apr_time_exp_t * tm,
 apr_time_t t)
{
return explode_recent_lt_ptr(tm, t);
}
-


Re: mod_autoindex string pluggability

2013-08-06 Thread Daniel Lescohier
output_directories seems html-specific code. Why not implement a brand-new
output_directories_json, and have index_directories() function choose which
function to call based on query args?  Anyway, index_directories() has to
change to send the correct content-type header, and to skip emit_head and
emit_tail function calls.


On Mon, Aug 5, 2013 at 2:26 AM, Sven Dowideit svendowid...@home.org.auwrote:

 Hello Everyone,

 I'm scratching an itch to make mod_autoindex output what I want, and
 would love to know what, if anything would make the changes merge-able.

 In its simplest form, I'd like apache to be able to give me an index in
 JSON format - previously, I've parsed the html in javascript, but
 somehow I think I can do better.

 While I was reading the code (today) it occurred to me that there are a
 lot of if statements throughout, which could be eliminated by moving
 (obscuring) the output strings to a switchable container (right now I'm
 using arrays of strings for my simplicity - I don't know this codebase
 any better than you know me :)

 so here is the kind of thing I started to do (partial diff, its all
 really kind of dull - I've extracted the HTML/XHTML strings into another
 similarly replaceable array):


 #define STRING_INDEX_START   0
 #define STRING_INDEX_END 1

 const char *table_index_string[] = {
 table\n   tr,
   /table\n
 };

 const char *table_index_string_stylesheet[] = {
 table id=\indexlist\\n   tr class=\indexhead\,
   /table\n
 };

 const char *fancy_index_string[] = {
   pre,
   /pre\n
 };

 const char *default_index_string[] = {
   ul,
   /ul\n
 };

 /* set the default string set (choose alternatives based on user conf
 options) */
 const char **index_string = default_index_string;

 @@ -1873,23 +1872,14 @@ static void output_directories(struct ent **ar,
 int n,
  }
  }
  if (autoindex_opts  TABLE_INDEXING) {
 -ap_rvputs(r, breakrow, /table\n, NULL);
 +ap_rputs(breakrow, r);
  }
  else if (autoindex_opts  FANCY_INDEXING) {
  if (!(autoindex_opts  SUPPRESS_RULES)) {
 -ap_rputs(hr, r);
 -if (autoindex_opts  EMIT_XHTML) {
 -ap_rputs( /, r);
 -}
 -ap_rputs(/pre\n, r);
 -}
 -else {
 -ap_rputs(/pre\n, r);
 +ap_rputs(output_string[STRING_HR], r);
  }
  }
 -else {
 -ap_rputs(/ul\n, r);
 -}
 +ap_rputs(index_string[STRING_INDEX_END], r);
  }

 Cheers
 Sven




Re: Decrypting mod_session-created cookie

2013-07-09 Thread Daniel Lescohier
Looking at the code, the encoded data block does not have a version number
field in it, so the data format and general algorithm cannot be easily
changed in the future.

The encoded data seems to be comprised of:

base64 encode of:

   - salt (fixed length of sizeof(apr_uuid_t))
   - the Initialization Vector
   - the encrypted data

But to understand the algorithm, one would have to document what all the
apr_crypto_* functions do in the module's encrypt/decrypt functions.  For
example, the apr_crypto_passphrase takes the passphrase and the salt as
parameters (along with other parameters such as block chaining mode [which
mod_session_crypto has set fixed to CBC], number of iterations [which
mod_session_crypto has set fixed to 4096 iterations]).  But what is done
with it when it's passed to APR library?  One would have to look at the
implementation of that function in order to find out.  Taking a quick look
at the apr source, for the openssl implementation, apr_crypto_passphrase is
using PKCS5_PBKDF2_HMAC_SHA1 to generate the key from the passphrase and
salt.

So, it doesn't look like it's an quick and easy thing to document or
re-implement.



On Tue, Jul 9, 2013 at 9:01 AM, André Malo n...@perlig.de wrote:

 On Tuesday 09 July 2013 14:51:49 Mikhail T. wrote:
  09.07.2013 08:31, Eric Covener написав(ла):
   I am not surprised that separately documenting is not a priority for
   anyone,  given it's implemented in httpd for anyone to see.
 
  Source code is not documentation, is it? In matters of encryption
  especially the methods chosen should be documented better -- not only
  for interoperability, but also to allow to better judge the strengths of
  the method (not that I personally am capable of the latter).

 I agree. And everybody is welcome to help fixing that issue.

 nd



Re: Decrypting mod_session-created cookie

2013-07-08 Thread Daniel Lescohier
https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy

The session will be automatically decrypted on load, and encrypted on save
by Apache, the underlying application using the session need have no
knowledge that encryption is taking place.


On Mon, Jul 8, 2013 at 6:58 PM, Mikhail T. mi+t...@aldan.algebra.comwrote:

  From PHP I need to be able to set and read the session cookies created by
 mod_session_crypto.

 Suppose, I know the SessionCryptoCipher (aes256, the default) and the
 SessionCryptoPassphrase, how can I decrypt the session-cookie? Its value is
 available to PHP as _REQUEST['session']...

 I have both openssl and mcrypt PHP-extensions available on the server, but
 the straightforward approach of

 $decrypted = openssl_decrypt($_REQUEST['session'], 'aes256', $password);

 is failing... Thank you! Yours,

 -mi




Re: Decrypting mod_session-created cookie

2013-07-08 Thread Daniel Lescohier
The mod_session_crypto.c adds a salt (from calling apr_uuid_get) to the
data when encrypting it.  Without a salt, the encryption wouldn't be that
strong.  Perhaps your decryption code isn't handling the salt?



On Mon, Jul 8, 2013 at 7:29 PM, Graham Leggett minf...@sharp.fm wrote:

 On 9 Jul 2013, at 00:11, Daniel Lescohier daniel.lescoh...@cbsi.com
 wrote:

 https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy

 The session will be automatically decrypted on load, and encrypted on
 save by Apache, the underlying application using the session need have no
 knowledge that encryption is taking place.


 See also the section on integrating with external applications.

 https://httpd.apache.org/docs/2.4/mod/mod_session.html#integration

 Regards,
 Graham
 --




Re: Decrypting mod_session-created cookie

2013-07-08 Thread Daniel Lescohier
You could perhaps also setup Apache as a reverse-proxy to the other
application, so Apache will decrypt it before proxying it to the other
application.



On Mon, Jul 8, 2013 at 7:33 PM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 The mod_session_crypto.c adds a salt (from calling apr_uuid_get) to the
 data when encrypting it.  Without a salt, the encryption wouldn't be that
 strong.  Perhaps your decryption code isn't handling the salt?



 On Mon, Jul 8, 2013 at 7:29 PM, Graham Leggett minf...@sharp.fm wrote:

 On 9 Jul 2013, at 00:11, Daniel Lescohier daniel.lescoh...@cbsi.com
 wrote:

 https://httpd.apache.org/docs/2.4/mod/mod_session.html#sessionprivacy

 The session will be automatically decrypted on load, and encrypted on
 save by Apache, the underlying application using the session need have no
 knowledge that encryption is taking place.


 See also the section on integrating with external applications.

 https://httpd.apache.org/docs/2.4/mod/mod_session.html#integration

 Regards,
 Graham
 --





Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID

2013-07-07 Thread Daniel Lescohier
Another option:

typedef struct {
apr_uint32_t stamp;
apr_uint32_t counter;
apr_uint16_t stamp_fraction;
char root[ROOT_SIZE];
} unique_id_rec;

where ROOT_SIZE=8, and stamp_fraction is set on every request to
htons(r-request_time  0x).

The child will be initialized with 12 bytes of random data (root +
counter).  If one is not as confident of having a good random source, then
it's better to also include the fractional second: there will be few
requests happening in the same 1/65,536th of a second of the wall clock; if
there are multiple requests in that fraction of a second, if the multiple
requests are in the same child process, then the atomic increment
(apr_atomic_inc32) of the counter alone will guarantee different ids;
otherwise, for those few requests in that fraction of a second in multiple
processes (on the same or different servers), the randomness initialized by
the child in root+counter will be used to make different ids.



On Sat, Jul 6, 2013 at 9:38 AM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 Ah, I missed that.  I see it's in the doxygen docs for the random module.
 However, the sources aren't under random/, they're under misc/.  I was
 switching between the doxygen docs and looking at the sources; maybe when I
 was looking for it, I missed it because the sources were under misc/.

 For the standard module: you can keep the unique id size the same, but
 still have a root of 10 bytes, by getting rid of the thread_index:
 thread_index is very wasteful, using 32 bits, when only a few bits are used.

 Change:

 typedef struct {
 unsigned int stamp;
 unsigned int in_addr;
 unsigned int pid;
 unsigned short counter;
 unsigned int thread_index;
 } unique_id_rec;

 to:

 typedef struct {
 apr_uint32_t stamp;
 apr_uint32_t counter;
 char root[ROOT_SIZE];
 } unique_id_rec;

 Have the two ints first for alignment purposes, so there is no padding in
 the struct.

 With the counter field using apr_uint32_t, you can use apr_atomic_inc32()
 to do the increments.  There is less need of thread_index if you increment
 the counter atomically.

 The initialization of the counter with random data also gives you more
 process randomness than the 10 bytes of root.



 On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Wednesday 26 June 2013, Daniel Lescohier wrote:
  When I looked into the ap random functions, I didn't like the
  implementation, because I didn't see anywhere in the httpd codebase
  that entropy is periodically added to the entropy pool.  After
  reading the details of how the Linux entropy pool works
  (https://lwn
  .net/Articles/525204/), I decided to use /dev/urandom instead,
  since Linux is periodically adding entropy to it.  This code is
  not portable, but this was for a private Apache module that is
  only used on Linux.
 
  To preserve entropy on the web server machine, I also only generate
  a random number once per apache child, then increment an uint32
  portion of it for each unique id call.  I also have seconds and
  microseconds, so that's why I think it's OK to do increments from
  the random base, instead of generating a new random id on each
  request.

 The insecure in ap_random_insecure_bytes is there for a reason. But
 if you only use it once per process, anyway, it should be sufficient.
 The fact that several consumers (especially with multi-threaded mpms)
 pull from the same pool in undefined order adds some entropy, too.

 FWIW, there is apr_generate_random_bytes() which can do the reading of
 /dev/urandom for you.





Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID

2013-07-07 Thread Daniel Lescohier
Actually, for stamp_fraction, one should choose either:

r-request_time  0x /* microseconds modulus 65536 */

apr_uint16_t ui16;
htons(ui16 = apr_time_usec(r-request_time)  4) /* fraction of a second
with denominator 62500 */



On Sun, Jul 7, 2013 at 7:32 AM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 Another option:


 typedef struct {
 apr_uint32_t stamp;
 apr_uint32_t counter;
 apr_uint16_t stamp_fraction;
 char root[ROOT_SIZE];
 } unique_id_rec;

 where ROOT_SIZE=8, and stamp_fraction is set on every request to
 htons(r-request_time  0x).

 The child will be initialized with 12 bytes of random data (root +
 counter).  If one is not as confident of having a good random source, then
 it's better to also include the fractional second: there will be few
 requests happening in the same 1/65,536th of a second of the wall clock; if
 there are multiple requests in that fraction of a second, if the multiple
 requests are in the same child process, then the atomic increment
 (apr_atomic_inc32) of the counter alone will guarantee different ids;
 otherwise, for those few requests in that fraction of a second in multiple
 processes (on the same or different servers), the randomness initialized by
 the child in root+counter will be used to make different ids.



 On Sat, Jul 6, 2013 at 9:38 AM, Daniel Lescohier 
 daniel.lescoh...@cbsi.com wrote:

 Ah, I missed that.  I see it's in the doxygen docs for the random
 module.  However, the sources aren't under random/, they're under misc/.  I
 was switching between the doxygen docs and looking at the sources; maybe
 when I was looking for it, I missed it because the sources were under misc/.

 For the standard module: you can keep the unique id size the same, but
 still have a root of 10 bytes, by getting rid of the thread_index:
 thread_index is very wasteful, using 32 bits, when only a few bits are used.

 Change:

 typedef struct {
 unsigned int stamp;
 unsigned int in_addr;
 unsigned int pid;
 unsigned short counter;
 unsigned int thread_index;
 } unique_id_rec;

 to:

 typedef struct {
 apr_uint32_t stamp;
 apr_uint32_t counter;
 char root[ROOT_SIZE];
 } unique_id_rec;

 Have the two ints first for alignment purposes, so there is no padding in
 the struct.

 With the counter field using apr_uint32_t, you can use apr_atomic_inc32()
 to do the increments.  There is less need of thread_index if you increment
 the counter atomically.

 The initialization of the counter with random data also gives you more
 process randomness than the 10 bytes of root.



 On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Wednesday 26 June 2013, Daniel Lescohier wrote:
  When I looked into the ap random functions, I didn't like the
  implementation, because I didn't see anywhere in the httpd codebase
  that entropy is periodically added to the entropy pool.  After
  reading the details of how the Linux entropy pool works
  (https://lwn
  .net/Articles/525204/), I decided to use /dev/urandom instead,
  since Linux is periodically adding entropy to it.  This code is
  not portable, but this was for a private Apache module that is
  only used on Linux.
 
  To preserve entropy on the web server machine, I also only generate
  a random number once per apache child, then increment an uint32
  portion of it for each unique id call.  I also have seconds and
  microseconds, so that's why I think it's OK to do increments from
  the random base, instead of generating a new random id on each
  request.

 The insecure in ap_random_insecure_bytes is there for a reason. But
 if you only use it once per process, anyway, it should be sufficient.
 The fact that several consumers (especially with multi-threaded mpms)
 pull from the same pool in undefined order adds some entropy, too.

 FWIW, there is apr_generate_random_bytes() which can do the reading of
 /dev/urandom for you.






Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID

2013-07-06 Thread Daniel Lescohier
Ah, I missed that.  I see it's in the doxygen docs for the random module.
However, the sources aren't under random/, they're under misc/.  I was
switching between the doxygen docs and looking at the sources; maybe when I
was looking for it, I missed it because the sources were under misc/.

For the standard module: you can keep the unique id size the same, but
still have a root of 10 bytes, by getting rid of the thread_index:
thread_index is very wasteful, using 32 bits, when only a few bits are used.

Change:

typedef struct {
unsigned int stamp;
unsigned int in_addr;
unsigned int pid;
unsigned short counter;
unsigned int thread_index;
} unique_id_rec;

to:

typedef struct {
apr_uint32_t stamp;
apr_uint32_t counter;
char root[ROOT_SIZE];
} unique_id_rec;

Have the two ints first for alignment purposes, so there is no padding in
the struct.

With the counter field using apr_uint32_t, you can use apr_atomic_inc32()
to do the increments.  There is less need of thread_index if you increment
the counter atomically.

The initialization of the counter with random data also gives you more
process randomness than the 10 bytes of root.



On Fri, Jul 5, 2013 at 8:04 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Wednesday 26 June 2013, Daniel Lescohier wrote:
  When I looked into the ap random functions, I didn't like the
  implementation, because I didn't see anywhere in the httpd codebase
  that entropy is periodically added to the entropy pool.  After
  reading the details of how the Linux entropy pool works
  (https://lwn
  .net/Articles/525204/), I decided to use /dev/urandom instead,
  since Linux is periodically adding entropy to it.  This code is
  not portable, but this was for a private Apache module that is
  only used on Linux.
 
  To preserve entropy on the web server machine, I also only generate
  a random number once per apache child, then increment an uint32
  portion of it for each unique id call.  I also have seconds and
  microseconds, so that's why I think it's OK to do increments from
  the random base, instead of generating a new random id on each
  request.

 The insecure in ap_random_insecure_bytes is there for a reason. But
 if you only use it once per process, anyway, it should be sufficient.
 The fact that several consumers (especially with multi-threaded mpms)
 pull from the same pool in undefined order adds some entropy, too.

 FWIW, there is apr_generate_random_bytes() which can do the reading of
 /dev/urandom for you.



Re: [PATCH] mod_unique_id: use ap_random_insecure_bytes() to get unique ID

2013-06-26 Thread Daniel Lescohier
We have an internal custom Apache module that generates UUIDs; it's using a
legacy format that we need for our application.  A few months ago, I was
upgrading it to Apache 2.4, and I worked on modifying the UUID generator
function, so that some of the bytes are random (including replacing the 4
byte IPv4 address element of the uuid that we used before with random data
instead).

When I looked into the ap random functions, I didn't like the
implementation, because I didn't see anywhere in the httpd codebase that
entropy is periodically added to the entropy pool.  After reading the
details of how the Linux entropy pool works (https://lwn
.net/Articles/525204/), I decided to use /dev/urandom instead, since Linux
is periodically adding entropy to it.  This code is not portable, but this
was for a private Apache module that is only used on Linux.

To preserve entropy on the web server machine, I also only generate a
random number once per apache child, then increment an uint32 portion of it
for each unique id call.  I also have seconds and microseconds, so that's
why I think it's OK to do increments from the random base, instead of
generating a new random id on each request.

Our app requires network order time_t in the second 32-bit word, so that's
why the function is like below.  If our app didn't require time_t, then I
would not have used the high order bits of time_t, because it's fairly
constant: to have a more unique id, it would be better to have more random
data instead of that fairly constant data.

/**
* Make a new unique id as follows:
* 1. Define the id.
* 2. Base64-encode the id.
* @param r Apache's request record.
* @return The base64-encode cookie string, allocated in the request pool.
*/
static char *make_id( request_rec *r )
{
#define RAW_COOKIE_SIZE (14)
#define WANT_BASE64_COOKIE_SIZE (19)
#define BASE64_ENCODE_LEN(len) (((len + 2) / 3 * 4) + 1)
#define MASK_BITS(n) ((1n) - 1)

struct {
apr_uint32_t i1; /* b31: 0, b30-20: random, b19-0: usec; network o
*/
apr_uint32_t i2; /* seconds, network order */
apr_uint32_t i3; /* random, host order */
apr_uint16_t s1; /* random, host order */
} raw_cookie;
apr_uint32_t i, rnd;
char * const b64_cookie = apr_palloc(
r-pool, BASE64_ENCODE_LEN(RAW_COOKIE_SIZE));

ap_assert(b64_cookie);
if (unlikely(! rand_initialized)) {
/* We delay initialization as late as possible in order
 * to collect more entropy in the random pool. */
init_random(r);
}

rnd = apr_atomic_inc32(rand_uints.i1);
i = rnd  MASK_BITS(11);
i = 20;
i += apr_time_usec(r-request_time);
raw_cookie.i1 = htonl(i);
raw_cookie.s1 = rnd  11;
raw_cookie.i2 = htonl(apr_time_sec(r-request_time));
raw_cookie.i3 = rand_uints.i2;

apr_base64_encode_binary(b64_cookie, (const void *)raw_cookie,
RAW_COOKIE_SIZE);
if (BASE64_ENCODE_LEN(RAW_COOKIE_SIZE)  WANT_BASE64_COOKIE_SIZE)
b64_cookie[WANT_BASE64_COOKIE_SIZE] = '\0';
return b64_cookie;
}

And supporting code:

typedef struct {
apr_uint32_t i1;
apr_uint32_t i2;
} rand_uints_t;
static rand_uints_t rand_uints;
static apr_thread_mutex_t *rand_mutex;
static int rand_initialized;

static void init_random(request_rec *r)
{
ap_assert(APR_SUCCESS == apr_thread_mutex_lock(rand_mutex));
if (rand_initialized == 0) { /* we won the race to initialize */
apr_file_t *f;
apr_size_t bytes_read;
ap_assert(APR_SUCCESS == apr_file_open(f, /dev/urandom,
APR_READ|APR_BINARY, 0444, r-pool));
ap_assert(APR_SUCCESS == apr_file_read_full(f, rand_uints,
sizeof(rand_uints_t), bytes_read));
ap_assert(sizeof(rand_uints) == bytes_read);
ap_assert(APR_SUCCESS == apr_file_close(f));
rand_initialized = 1;
}
ap_assert(APR_SUCCESS == apr_thread_mutex_unlock(rand_mutex));
}


/**
* Initialization for each child process.
*/
static void init_child(apr_pool_t *p, server_rec *s) {
ap_assert(APR_SUCCESS == apr_thread_mutex_create(rand_mutex,
APR_THREAD_MUTEX_DEFAULT, p));
}



On Wed, Jun 26, 2013 at 8:49 AM, Jan Kaluža jkal...@redhat.com wrote:

 Hi,

 currently mod_unique_id uses apr_gethostname(...) and PID pair as a base
 to generate unique ID. The way how it's implemented brings some problems:

 1. For IPv6-only hosts it uses low-order bits of IPv6 address as if they
 were unique, which is wrong.

 2. It relies on working DNS. It can happen that hostname does not have the
 IP assigned during httpd start (for example during the boot) and I think it
 is still valid use-case (without mod_unique_id module loaded, httpd works
 well in this case).

 3. It calls 1 second sleep to overcome possible usage of the same PID
 after restart as the one used before the restart.

 If I'm right, we could fix the problems above by using
 ap_random_insecure_bytes instead of in_addr/pid pair as a base for
 unique ID generation. It would also make 

Re: Time for 2.4.5 ??

2013-05-26 Thread Daniel Lescohier
+ is not used in url-encoding.  + is used in
application/x-www-form-urlencoded encoding.  So I assume this function has
implemented unescaping application/x-www-form-urlencoded.  For regular
URL-encoding, space is encoded as %20.


On Fri, May 24, 2013 at 11:46 PM, Guenter Knauf fua...@apache.org wrote:

 Hi Daniel,

 On 25.05.2013 02:06, Guenter Knauf wrote:

 On 24.05.2013 23:45, Daniel Gruno wrote:

 That's fine by me, I'm not married to 'sleep' (although I do like a good
 nap)

 hehe, ok; I look into it soon.

 done.

 Found another small docu bug:

 r:unescape(string) -- Unescapes an URL-escaped string:

 local url = http%3a%2f%2ffoo.bar%2f1+2+3+**%26+4+%2b+5
 local unescaped = r:escape(url) -- returns 'http://foo.bar/1 2 3  4 + 5'

 the function call should here be r:unescape(url) ...

 Gün.





Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-08 Thread Daniel Lescohier
It looks like apr_atomic is not in good-enough shape to be used for
lock-free algorithms.  It's probably good enough for the use-cases that the
event mpm uses it for, because the event mpm is not using it for protecting
memory besides the counters it's atomically changing.  However, the time
caches would use the cache key to protect other memory (the cache value),
so apr_atomic is not good enough, because it's inconsistent on the
different platforms with respect to memory barriers.

For example, with x86, the add32 and cas32 functions act as a full memory
barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
as a memory barrier on the individual item being changed, it doesn't act as
a memory barrier on all memory.  One would have to add lightweight sync
assembler instructions to the add32 and cas32 function implementations for
powerpc if you want the same semantics as x86.

The problem is that the apr_atomic API is not explicit with the memory
ordering semantics.  I agree with you that in the future the apr_atomic API
should be changed to be more like the C11 atomic API, which allows explicit
choice of memory ordering.

But that leaves us with the problem of the current time caching code, which
is not thread-safe.  We should fix it using the existing APR APIs, and not
use the apr_atomic API which is broken for this use-case.

I found that there is a lock-free method available to us with the current
APR API.  We can use apr_thread_mutex_trylock.  In glibc, that just
attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a
futex system call and a wait on the caller, so it's lock free.  On Windows,
TryEnterCriticalSection probably works the same way.

So, the pseudo-code would be something like:

For reading from cache:

apr_uint32_t key = cache_element-key;
/* Above is done speculatively */
if (seconds == key  seconds != 0) {
/* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
*xt = cache_element-xt;
/* After copying xt, make sure cache_element was not marked invalid
 * by another thread beginning an update, and that cache_element
 * really contained data for our second. */
if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) {
key = cache_element-key;
apr_thread_mutex_unlock(cache_element-mutex);
if (key == seconds) {
xt-tm_usec = (int)apr_time_usec(t);
return APR_SUCCESS;
}
}
}
/* calculate the value if reached here */

Write to cache code:

if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS)
{  /* We won the race to update this cache_element. */
cache_element-key = ~seconds;
/* Above marks cache_element as invalid by using ~seconds,
 * for the speculative reads not using the mutex. */
cache_element-xt = *xt;
/* Finished copying, now update key with our key */
cache_element-key = seconds;
apr_thread_mutex_unlock(cache_element-mutex);
}

The above is lock-free, because process scheduling decisions by the OS
won't cause the current thread to block, it'll keep on doing work without
blocking as long as this thread is scheduled.  It would be fast, though
perhaps not quite as fast than if we had an atomic api we could use
effectively.  With an atomic API, the read-from-time-cache code would only
be reading from cache-lines, not writing to cache-lines with the futex's
cmpxchg instruction.  You might also be able to use a lighter-weight memory
barrier.  But the mutex is lock-free and fast, and it's something that'd
work with today's APIs; and the existing code in production is not correct.

BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc, because
the readers/writers count variables are protected with a low-level lock
futex that is acquired with lll_lock, not lll_trylock.  So, if a thread
acquires the low-level lock, but is unscheduled by the OS before it
finishes updating the reader/writers count variables and unlocks the
low-level lock, that would block other threads from acquiring the
low-level lock, and other threads would block before being able to access
the readers/writers count variables.  If it had used lll_trylock instead, I
think it would've been lock-free.  However, I don't think rwlock would give
us any particular advantage than a regular mutex for the above algorithm.

So using the simpler mutex is better and is lock-free.  I think the
important part is to have a lock-free algorithm: if we lose a little bit of
performance by not being able to use the lightest lock-free algorithm, I
think that is OK.

On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Sunday 06 January 2013, Daniel Lescohier wrote:
  I'm not sure we should include memory barriers inside the
  apr_atomic_read32 and apr_atomic_set32 functions, because

Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-08 Thread Daniel Lescohier
Some timing tests:

 gcc -I/usr/include/apr-1 -lrt -lapr-1 -lpthread test.c
 ./a.out
apr_time_exp_lt: 108 nanosecs per call
localtime_r: 85 nanosecs per call
apr_time_exp_gmt: 59 nanosecs per call
gmtime_r: 35 nanosecs per call
pthread_mutex_trylock/unlock/read cache: 19 nanosecs per call
lfence/read cache: 8 nanosecs per call
read cache: 4 nanosecs per call
 cat test.c
#include time.h
#include stdio.h
#include assert.h
#include apr_time.h
#include pthread.h


int main (int argc, char *argv[]) {
struct timespec old, new;
struct tm tmv;
int i, iterations=1000;
long nanosecs;
#define CALC_NS(new, old) ((new.tv_sec - old.tv_sec)*10 +
new.tv_nsec - old.tv_nsec)
apr_time_t t = apr_time_now();
apr_time_exp_t xt;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
unsigned int seconds=1;
typedef struct { char s[44]; unsigned int key;} data_t;
static volatile data_t cache_data;
data_t my_copy;

cache_data.key = 1;

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
apr_time_exp_lt(xt, t);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(apr_time_exp_lt: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
localtime_r(old.tv_sec, tmv);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(localtime_r: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
apr_time_exp_gmt(xt, t);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(apr_time_exp_gmt: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
gmtime_r(old.tv_sec, tmv);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(gmtime_r: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
if (pthread_mutex_trylock(mutex)==0) {
key = cache_data.key;
pthread_mutex_unlock(mutex);
}
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(pthread_mutex_trylock/unlock/read cache: %d nanosecs per
call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
asm volatile(lfence:::memory);
key = cache_data.key;
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(lfence/read cache: %d nanosecs per call\n,
nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
key = cache_data.key;
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(read cache: %d nanosecs per call\n, nanosecs/iterations);
}


On Tue, Jan 8, 2013 at 1:50 PM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 It looks like apr_atomic is not in good-enough shape to be used for
 lock-free algorithms.  It's probably good enough for the use-cases that the
 event mpm uses it for, because the event mpm is not using it for protecting
 memory besides the counters it's atomically changing.  However, the time
 caches would use the cache key to protect other memory (the cache value),
 so apr_atomic is not good enough, because it's inconsistent on the
 different platforms with respect to memory barriers.

 For example, with x86, the add32 and cas32 functions act as a full memory
 barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
 as a memory barrier on the individual item being changed, it doesn't act as
 a memory barrier on all memory.  One would have to add lightweight sync
 assembler instructions to the add32 and cas32 function implementations for
 powerpc if you want the same semantics as x86.

 The problem is that the apr_atomic API is not explicit with the memory
 ordering semantics.  I agree with you that in the future the apr_atomic API
 should be changed to be more like the C11 atomic API, which allows explicit
 choice of memory ordering.

 But that leaves us with the problem of the current time caching code,
 which is not thread-safe.  We should fix it using the existing APR APIs,
 and not use

Re: event mpm and mod_status

2013-01-07 Thread Daniel Lescohier
I see that event mpm uses a worker queue that uses a condition variable,
and it does a condition variable signal when something is pushed onto it.
If all of the cpu cores are doing useful work, the signal is not going to
force a context switch out of a thread doing useful work, the thread will
continue working until it's used up it's timeslice, as long as it doesn't
end early because it becomes blocked on i/o or something like that.  So, in
that case, the first thread to finish serving it's request and gets to
ap_queue_pop_something(worker_queue...) will get the item on the queue, and
it will have hot caches.  On the other hand, if only some of the cpu cores
are active, but the threads that are active are busy in the middle of a
request, the condition variable signal will wake up one of the idle
threads, the thread will be scheduled onto an idle core, and that thread
can start doing useful work.  That thread may have colder caches, but the
other cores were busy doing useful work any way, so it's worth activating
this thread on an idle core.

Also, if a worker thread is idle on waiting on
ap_queue_pop_something(worker_queue...), and the thread was unscheduled by
the OS process scheduler, when you wake it up again, the OS process
scheduler won't necessarily schedule it on the same cpu core it was on
before.  So, it won't necessarily have a warm cache after you wake it up
again.  You're only guaranteed a warm cache if the thread remains scheduled
by the OS.  The current queue/condvar implementation favors sending new
requests to threads that remain scheduled by the OS.

On Sat, Jan 5, 2013 at 9:37 AM, Jim Jagielski j...@jagunet.com wrote:

 +1... a lot of little improvements can result in a BIG
 improvement.

 On Jan 5, 2013, at 8:34 AM, Graham Leggett minf...@sharp.fm wrote:

  On 05 Jan 2013, at 2:05 AM, Stefan Fritsch s...@sfritsch.de wrote:
 
  For 1., a better thread selection would definitely be a win. For 2.
  and 3., it is less obvious.
 
  +1.
 
  Just because in some cases a cache might not help, doesn't mean we
 shouldn't take advantage of the cache when it will help.
 
  Regards,
  Graham
  --
 




Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Daniel Lescohier
I'd have to go back and review the Intel documentation to be sure, but for
this particular algorithm, an explicit memory barrier may be required on
Intel architecture, also.  If I remember correctly, without a memory
barrier, Intel arch only guarantees total memory ordering within one cache
line.  For this algorithm, we have an array of 16 cache_elements of 48
bytes each, so half of the cache_elements cross 64-byte cache lines.  When
reading the cache_element-key after the copy of the cache_element value,
we need to make sure that the cache_element value read is ordered before
the read of the cache_element-key, so one needs a memory barrier just
before the read of the cache_element-key to guarantee the ordering.

On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote:


   Sigh. I was too much focused on x86. There the compiler barrier
   caused
   by the function call is enough. But you are right, on other
   architectures these functions may also require cpu memory barriers.
 
   on x86 … is enough - would it harm x86 to add CPU barriers, or
  do we want to have # define distinctions per arch?

 ignore me, I just realized it's going to be different
 calls per arch anyway!

 --
 Igor Galić

 Tel: +43 (0) 664 886 22 883
 Mail: i.ga...@brainsware.org
 URL: http://brainsware.org/
 GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE




Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Daniel Lescohier
I was wrong about the Intel architecture: it has a pretty strong
memory-ordering guarantee.  But other architectures, like PowerPC and ARM,
have a weak memory-ordering guarantee.  Here is a good summary:
https://en.wikipedia.org/wiki/Memory_ordering.  The doc in footnote 3 is a
good document to read.

I'm not sure we should include memory barriers inside the apr_atomic_read32
and apr_atomic_set32 functions, because you don't necessarily need a full
memory barrier on each read or set.  Instead, perhaps we should add three
functions to APR:

   1. apr_atomic_rmb(): read memory barrier.
   2. apr_atomic_wmb(): write memory barrier.
   3. apr_atomic_mb(): full memory barrier.

As a use-case, the new util_time.c algorithm could be changed to be:

For the read-from-cache portion:

const apr_uint32_t key = apr_atomic_read32(cache_element-key);
/* Above is done speculatively, no memory barrier used.
if (seconds == key  seconds != 0) {
/* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
*xt = cache_element-xt;
/* After copying xt, make sure cache_element was not marked invalid
 * by another thread beginning an update, and that cache_element
 * really contained data for our second.
 * Requires memory barrier. */
apr_atomic_rmb();
if (apr_atomic_read32(cache_element-key)==seconds) {
xt-tm_usec = (int)apr_time_usec(t);
return APR_SUCCESS;
}
}

And the write-to-cache portion would be:

if (key == apr_atomic_cas32(cache_element-key, ~seconds, key))
{   /* We won the race to update this cache_element.
 * Above marks cache_element as invalid by using ~seconds,
 * because we are starting an update: it's the start of a
 * transaction. */
cache_element-xt = *xt;
/* Finished copying, now update key with our key,
 * ending the transaction. Need to use a
 * memory barrier.
 */
apr_atomic_wmb();
apr_atomic_set32(cache_element-key, seconds);
}

On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 I'd have to go back and review the Intel documentation to be sure, but for
 this particular algorithm, an explicit memory barrier may be required on
 Intel architecture, also.  If I remember correctly, without a memory
 barrier, Intel arch only guarantees total memory ordering within one cache
 line.  For this algorithm, we have an array of 16 cache_elements of 48
 bytes each, so half of the cache_elements cross 64-byte cache lines.  When
 reading the cache_element-key after the copy of the cache_element value,
 we need to make sure that the cache_element value read is ordered before
 the read of the cache_element-key, so one needs a memory barrier just
 before the read of the cache_element-key to guarantee the ordering.


 On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote:


   Sigh. I was too much focused on x86. There the compiler barrier
   caused
   by the function call is enough. But you are right, on other
   architectures these functions may also require cpu memory barriers.
 
   on x86 … is enough - would it harm x86 to add CPU barriers, or
  do we want to have # define distinctions per arch?

 ignore me, I just realized it's going to be different
 calls per arch anyway!

 --
 Igor Galić

 Tel: +43 (0) 664 886 22 883
 Mail: i.ga...@brainsware.org
 URL: http://brainsware.org/
 GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE





thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-04 Thread Daniel Lescohier
I entered a bug on the thread-safety of the time conversion caches in
server/util_time.c and modules/loggers/mod_log_config.c.

In brief, they're not thread-safe because:

   1. The algorithm depends on total memory ordering, and both the volatile
   qualifier and memory barriers are not used.
   2. The algorithm is subject to the ABA problem.

The details of the problem are here:

https://issues.apache.org/bugzilla/show_bug.cgi?id=54363

I included in the bug not-yet-tested patches with a different algorithm.

Do other people agree that the existing algorithm have the problems
explained in detail in the bug?

Thanks,

Daniel Lescohier


Re: event mpm and mod_status

2013-01-04 Thread Daniel Lescohier
I just saw this from last month from Stefan Fritsch and Niklas Edmundsson:

The fact that the client ip shows up on all threads points to some
 potential optimization: Recently active threads should be preferred,
 because their memory is more likely to be in the cpu caches. Right
 now, the thread that has been idle the longest time will do the work.

 Ah, virtually guaranteeing that the thread with the coldest cache gets to
 do the work...

I definitely agree on the potential for improvement here, you would most
 likely want to select either the thread that processed this request last
 time, or the most recently active idle thread. These two conditions kinda
 collides though, so the challenge is probably to come up with some rather
 cheap selection algorithm that is good enough.


Which CPU memory caches are you referring to?

   1. For stack, on a new request, you're writing a new call stack, the
   prior request's stack was unwound.
   2. For heap, you're creating a new request pool, so you will be popping
   an apr memnode from the head of the allocator's free list.  It may even be
   the same memnode that was pushed onto the free list when the prior
   request's apr pool was freed.
   3. On a new request, the next code/instructions it will execute will be
   the functions that reads the http request and populates the request_rec.
   That's different code than happened at the end of the prior request
   (serving the request, logging the request, etc.).

So, I'm not convinced a thread selection algorithm is needed.


Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-04 Thread Daniel Lescohier
apr_atomic_read32 is not implemented with a memory barrier.  The
implementation of apr_atomic_read32 in the APR code base is just a read
from a pointer marked volatile.  E.g., here is the atomic/unix/builtins.c
implementation:

APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem)
{
return *mem;
}

The atomic/unix/builtins.c implementation would have to change to this to
have a memory barrier:

APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem)
{
__sync_synchronize();
return *mem;
}

All the other apr atomic implementations would have to change if
apr_atomic_read32 is supposed to mean with-a-memory-barrier.  All the
current implementations don't implement a memory barrier (all the
implementations use the C definition shown at the beginning of this email),
so I assume apr_atomic_read32 (and apr_atomic_set32) means
without-a-memory-barrier.

The Apache Portable Runtime API has no memory barrier function in the API
(like gcc's builtin __sync_synchronize()), so I had to choose among the
other functions in the API.  I decided to use Compare-And-Swap for the
memory barrier, because the generic Compare-And-Swap concept implies the
use of memory barriers, and also the APR implementation actually implements
a memory barrier in apr_atomic_cas32.  I also could've used
apr_atomic_add32(cache_element-key, 0)==seconds, but Compare-And-Swap is
the more generic concept, so I used that instead.  I agree, though, for the
read-the-cache_element portion of the function, it'd be better to do just
an atomic_read, instead of a compare-and-swap, if only the APR Atomic API
had a memory barrier function that I could call before the read.

After copying from the time cache_element to this thread's memory allocated
from the request pool, the thread needs a memory barrier when checking the
cache_element-key, in order to make sure another cpu thread/core didn't
start modifying the cache_element while this thread was copying it.  But
note, before doing the copy to this thread's memory, the function doesn't
use a memory barrier, it checks the cache_element-key without a memory
barrier, because it's speculatively doing the copy (because almost all the
time there won't be a conflict), and then doing the validation with a
memory barrier after the copy is done (and in the rare case that the
validation fails, the thread just calculates the time value itself).

For the old code, I haven't seen bad results in practice (not that I
probably would've noticed it), I only noticed that the code is not portably
thread-safe because I was working on a project to upgrade our web servers
to 2.4, and I was updating our own internal custom Apache modules, and I
noticed this when looking at the source of mod_log_config.c.  So, I brought
it up because the algorithm does not look correct.  The problem was that
the old code was supposedly a lock-free and wait-free algorithm, but it
didn't use the atomic processor instructions you need to use in order to
implement lock-free and wait-free algorithms.

On Fri, Jan 4, 2013 at 7:58 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Friday 04 January 2013, Daniel Lescohier wrote:
  I entered a bug on the thread-safety of the time conversion caches
  in server/util_time.c and modules/loggers/mod_log_config.c.
 
  In brief, they're not thread-safe because:
 
 1. The algorithm depends on total memory ordering, and both the
  volatile qualifier and memory barriers are not used.
 2. The algorithm is subject to the ABA problem.

 I agree that the lack of memory barriers is a problem.

 And it ABA problem would not exist if callers of ap_recent_* would
 actually check that the time *is* recent (as written in the docs in
 util_time.h). This is not a problem for the error log, because it
 always uses apr_time_now(). But the request time in mod_log_config may
 be further in the past.

 Out of interest, have you seen the function give wrong results in
 practice?

  The details of the problem are here:
 
  https://issues.apache.org/bugzilla/show_bug.cgi?id=54363
 
  I included in the bug not-yet-tested patches with a different
  algorithm.
 
  Do other people agree that the existing algorithm have the problems
  explained in detail in the bug?

 I haven't reviewed your patches in detail, yet. One thing:
 apr_atomic_read should be enough, you don't need apr_atomic_cas with
 the same value.

 Cheers,
 Stefan