Re: Some Oprofile results

2004-07-13 Thread Brian Akins
Here's some of the the "gprof" output from Oprofile:
%   cumulative   self  self total  
time   samples   samplescalls  T1/call  T1/call  name   
 7.27 154306.00 154306.00 core_input_filter
 5.85 278523.00 124217.00 core_output_filter
 5.78 401320.00 122797.00 net_time_filter
 5.64 521049.00 119729.00 ap_rgetline_core
 4.97 626628.00 105579.00 pcre_exec
 4.37 719425.00 92797.00 match
 3.87 801602.00 82177.00 ap_getword_white
 3.41 873893.00 72291.00 _init
 3.09 939532.00 65639.00 ap_read_request
 2.86 1000314.00 60782.00 add_any_filter_handle
 2.81 1060025.00 59711.00 ap_escape_logitem
 2.27 1108215.00 48190.00 
ap_process_http_connection
 2.21 1155055.00 46840.00 ap_http_header_filter
 2.15 1200695.00 45640.00 ap_get_brigade
 1.80 1239009.00 38314.00 ap_byterange_filter
 1.70 1275003.00 35994.00 uniq_field_values
 1.61 1309195.00 34192.00 
ap_update_child_status_from_indexes
 1.52 1341515.00 32320.00 worker_thread
 1.52 1373810.00 32295.00 
ap_get_mime_headers_core
 1.35 1402563.00 28753.00 regexec
 1.35 1431145.00 28582.00 ap_pass_brigade
 1.25 1457724.00 26579.00 
ap_run_post_read_request
 1.21 1483482.00 25758.00 ap_process_request
 1.16 1508164.00 24682.00 ap_find_token
 1.16 1532765.00 24601.00 fix_hostname
 1.14 1556887.00 24122.00 ap_http_filter
 1.09 1580049.00 23162.00 ap_recent_rfc822_date
 1.07 1602729.00 22680.00 ap_run_create_request
 1.03 1624644.00 21915.00 
ap_update_child_status
 0.98 1645365.00 20721.00 read_request_line
 0.97 1666036.00 20671.00 core_create_req
 0.96 1686333.00 20297.00 
ap_content_length_filter
 0.85 1704396.00 18063.00 ap_make_method_list
 0.73 1719824.00 15428.00 ap_set_keepalive
 0.71 1734882.00 15058.00 ap_set_content_type
 0.70 1749708.00 14826.00 worker_stack_wait
 0.70 1764497.00 14789.00 ap_get_remote_host
 0.69 1779171.00 14674.00 
worker_stack_awaken_next
 0.68 1793672.00 14501.00 ap_save_brigade
 0.64 1807320.00 13648.00 
perform_idle_server_maintenance
 0.60 1820156.00 12836.00 
ap_discard_request_body
 0.59 1832632.00 12476.00 
send_all_header_fields
 0.58 1844914.00 12282.00 ap_method_number_of
 0.56 1856835.00 11921.00 ap_run_quick_handler
 0.55 1868577.00 11742.00 cached_explode
 0.54 1880120.00 11543.00 lookup_builtin_method
 0.54 1891652.00 11532.00 
ap_add_input_filter_handle
 0.54 1903038.00 11386.00 ap_get_server_name
 0.51 1913943.00 10905.00 
ap_update_vhost_from_headers

Here's some of the annoted source for core_input_filter:
:static int core_input_filter(ap_filter_t *f, apr_bucket_brigade *b,
  : ap_input_mode_t mode, 
apr_read_type_e block,
  : apr_off_t readbytes)
42610  0.5149 :{ /* core_input_filter total: 154306  1.8647 */

  :/* ### This is bad. */
16586  0.2004 :BRIGADE_NORMALIZE(ctx->b);
65812  0.7953 :if (APR_STATUS_IS_EAGAIN(rv)) {
  :rv = APR_SUCCESS;
  :}

--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies


Re: Some Oprofile results

2004-07-11 Thread Brian Pane
Brian Akins wrote:
I'm writing an optimized caching module.  I've been using 2.0.50.  
Here are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ):

Linux cnnsquid2 2.6.7-cnn.1smp #1 SMP Wed Jun 16 13:41:14 EDT 2004 
x86_64 x86_64 x86_64 GNU/Linux

Using leader mpm.  I patched apr_atomics so that atomics work on x84_64
The serving is serving ~27k requests per second:
Are there optimizations to apr_palloc in 2.1, this seems to be a good 
place to optimize.

There's not much left to optimize in apr_palloc itself.  In the common
case, the execution path consists of a single conditional check and a
bit of pointer addition.  But reducing the number of calls to apr_palloc
could yield an improvement.  Do you happen to have a way to enable
call-graph profiling on your test system, to find out what the top call
chains are for the top 50 functions?  (My understanding is that recent
releases of oprofile support this with an optional patch.)
The large amount of time spent in memcpy could also be a good
candidate for optimization.
Brian


Re: Some Oprofile results

2004-07-07 Thread Bill Stoddard
Brian Akins wrote:
Bill Stoddard wrote:
I'd certainly be interested in knowing how much faster mod_url_cache 
is as compared to mod_cache/mod_mem_cache. The only way I can see to 
-dramatically- improve the performance of mod_cache/mod_mem_cache is 
to completely bypass i/o filters. I have some patches to mod_mem_cache 
laying around that give perhaps a 10% performance improvement by 
prebuilding headers and aligning them on cache boundaries, but 10% is 
not really that great.

Like I said, it can fill up two Gig pipes on a dual Opteron with ~25% 
idle time. mod_cache can't fill up one.

If you run in quick_handler, the only way you get filters (other than 
core_ouput) is if you add them your self.  Guess if I did or not :)

If the headers are stored and "thawed" in an efficient manner (both 
processor and memory) they don't affect the performance much.  Also, 
sometimes mmap is slower than just plain old read.

Also, using a filesystem and sendfile is much faster than serving from 
RAM, plus multiple processes can "share" the same cache.  mod_cache 
doesn't do that.

All of these are based on my observartions of our environment.
Thanks, that's what I wanted to know.  I can pretty much guess the implementation (I've done similar stuff 
before).  mod_mem_cache takes a middle of the road approach; performance is sacrificed to provide some ability 
to filter i/o. However, it also makes sacrifices in functionality (like access/auth checking) by using the 
quick_handler to serve pages. Too many compromises for the 'i want full functionality' folks and too many 
compromises for the 'i want best raw performance' folks.  mod_mem_cache tries to be all things to all masters 
but ends up falling short by all accounts. I've got two more patches lying around that I may dust off if i get 
any free time :-)  The first introduces a config directive that enables moving the mod_mem_cache handler to 
the traditional handler phase (to pick up full access/auth checking and full filter capabilities). The second 
patch (I call it my 'go fast patch') does some experiments with prebuilding/caching the response headers and 
bypassing output filtering (in concept not unlike your mod_url_cache I expect).

Bill


Re: Some Oprofile results

2004-07-07 Thread Brian Akins
Bill Stoddard wrote:
I'd certainly be interested in knowing how much faster mod_url_cache 
is as compared to mod_cache/mod_mem_cache. The only way I can see to 
-dramatically- improve the performance of mod_cache/mod_mem_cache is 
to completely bypass i/o filters. I have some patches to mod_mem_cache 
laying around that give perhaps a 10% performance improvement by 
prebuilding headers and aligning them on cache boundaries, but 10% is 
not really that great.

Like I said, it can fill up two Gig pipes on a dual Opteron with ~25% 
idle time. mod_cache can't fill up one.

If you run in quick_handler, the only way you get filters (other than 
core_ouput) is if you add them your self.  Guess if I did or not :)

If the headers are stored and "thawed" in an efficient manner (both 
processor and memory) they don't affect the performance much.  Also, 
sometimes mmap is slower than just plain old read.

Also, using a filesystem and sendfile is much faster than serving from 
RAM, plus multiple processes can "share" the same cache.  mod_cache 
doesn't do that.

All of these are based on my observartions of our environment. 

--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies


Re: Some Oprofile results

2004-07-07 Thread Bill Stoddard
Brian Akins wrote:
Jean-Jacques Clar wrote:
Would you please share with us some info on that
optimized caching module.

I'll say what I can.  I'm hoping I can GPL it.

Is mod_url_cache a modified copy of the current ones?

No.  It borrows some ideas from various other modules, including 
mod_cache and mod_file_cache.

Why are the current ones not fulfilling your needs?

In our situation, we know alot about the data that is being cached, so 
we can be more aggressive with caching.  mod_cache cannot cache url that 
end in /, so it would not help us.  Also, mod_cache (either disk or mem) 
were too slow for us.  mod_url_cache (I couldn't come up with a better 
name) can fill up two Gigabit interfaces on our Opterons and still have 
~25% CPU idle time.
I'd certainly be interested in knowing how much faster mod_url_cache is as compared to 
mod_cache/mod_mem_cache. The only way I can see to -dramatically- improve the performance of 
mod_cache/mod_mem_cache is to completely bypass i/o filters. I have some patches to mod_mem_cache laying 
around that give perhaps a 10% performance improvement by prebuilding headers and aligning them on cache 
boundaries, but 10% is not really that great.

Bill


Re: Some Oprofile results

2004-07-07 Thread Brian Akins
Jean-Jacques Clar wrote:
Would you please share with us some info on that
optimized caching module.

I'll say what I can.  I'm hoping I can GPL it.

Is mod_url_cache a modified copy of the current ones?
No.  It borrows some ideas from various other modules, including 
mod_cache and mod_file_cache. 

Why are the current ones not fulfilling your needs?

In our situation, we know alot about the data that is being cached, so 
we can be more aggressive with caching.  mod_cache cannot cache url that 
end in /, so it would not help us.  Also, mod_cache (either disk or mem) 
were too slow for us.  mod_url_cache (I couldn't come up with a better 
name) can fill up two Gigabit interfaces on our Opterons and still have 
~25% CPU idle time.

--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies


Re: Some Oprofile results

2004-07-07 Thread Brian Akins
Bill Stoddard wrote:
I submitted a patch to rework ap_rgetline/core_input_filter a year or 
so ago that provided some performance improvements. If you are 
inclined, it might be interesting to dig that patch out and see how it 
profiles.

Sure. Repost-it and/or send it to me.
--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies


Re: Some Oprofile results

2004-07-06 Thread Jean-Jacques Clar


>I'm writing an optimized caching module.
Would you please share with us some info on that 
optimized caching module.
Is mod_url_cache a modified copy of the current ones?
Why are the current ones not fulfilling your needs?
 
Thanks,
JJ


Re: Some Oprofile results

2004-07-06 Thread Bill Stoddard
Brian Akins wrote:
I'm writing an optimized caching module.  I've been using 2.0.50.  Here 
are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ):
I submitted a patch to rework ap_rgetline/core_input_filter a year or so ago that provided some performance 
improvements. If you are inclined, it might be interesting to dig that patch out and see how it profiles.

Bill


Some Oprofile results

2004-07-06 Thread Brian Akins
I'm writing an optimized caching module.  I've been using 2.0.50.  Here 
are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ):

Linux cnnsquid2 2.6.7-cnn.1smp #1 SMP Wed Jun 16 13:41:14 EDT 2004 
x86_64 x86_64 x86_64 GNU/Linux

Using leader mpm.  I patched apr_atomics so that atomics work on x84_64
The serving is serving ~27k requests per second:
Are there optimizations to apr_palloc in 2.1, this seems to be a good 
place to optimize.

CPU: AMD64 processors, speed 2193.21 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a 
unit mask of 0x00 (No unit mask) count 10
samples  %image name   symbol name
3093396.5157  libc-2.3.3.so__GI_memcpy
2537115.3440  libc-2.3.3.so__GI_strlen
1352602.8490  libc-2.3.3.so__GI___strcasecmp
1122652.3647  libapr-0.so.0.9.5apr_palloc
89160 1.8780  libc-2.3.3.so__GI_memset
85604 1.8031  httpdcore_input_filter
84395 1.7776  libaprutil-0.so.0.9.5apr_brigade_create
83813 1.7654  libapr-0.so.0.9.5apr_table_setn
82846 1.7450  httpdmatch
82736 1.7427  libapr-0.so.0.9.5apr_table_get
81288 1.7122  httpdpcre_exec
81011 1.7064  libaprutil-0.so.0.9.5apr_brigade_writev
71233 1.5004  httpdnet_time_filter
70346 1.4817  httpdcore_output_filter
68436 1.4415  mod_setenvif.so  match_headers
65654 1.3829  libapr-0.so.0.9.5apr_vformatter
61061 1.2861  libc-2.3.3.so__GI_memchr
58134 1.2245  httpdap_rgetline_core
57236 1.2056  httpdap_getword_white
57038 1.2014  libaprutil-0.so.0.9.5apr_brigade_split_line
55767 1.1746  libapr-0.so.0.9.5apr_pool_cleanup_register
52988 1.1161  mod_log_config.soconfig_log_transaction
48604 1.0238  httpdap_escape_logitem
46163 0.9723  libapr-0.so.0.9.5allocator_alloc
45412 0.9565  libaprutil-0.so.0.9.5apr_brigade_cleanup
45172 0.9515  libpthread-0.61.so   __pthread_mutex_lock_internal
41772 0.8799  mod_url_cache.so url_cache_handler
40048 0.8435  libaprutil-0.so.0.9.5MD5Transform
39382 0.8295  httpdap_read_request
39253 0.8268  libapr-0.so.0.9.5apr_socket_timeout_set
38626 0.8136  libpthread-0.61.so   __libc_read
38383 0.8085  libaprutil-0.so.0.9.5apr_uri_parse
38003 0.8005  httpdadd_any_filter_handle
36614 0.7712  libapr-0.so.0.9.5apr_poll
36360 0.7659  libaprutil-0.so.0.9.5apr_bucket_heap_make
35417 0.7460  libapr-0.so.0.9.5apr_socket_recv
34016 0.7165  mod_url_cache.so parse_cache_headers
32330 0.6810  libaprutil-0.so.0.9.5apr_bucket_heap_create
31767 0.6691  httpdap_http_header_filter
31348 0.6603  httpdap_process_http_connection
29833 0.6284  libapr-0.so.0.9.5apr_table_addn
29288 0.6169  libaprutil-0.so.0.9.5heap_bucket_read
29273 0.6166  httpdanonymous symbol from section 
.plt
29122 0.6134  libaprutil-0.so.0.9.5socket_bucket_read
29038 0.6116  libaprutil-0.so.0.9.5apr_bucket_free
28632 0.6031  httpdap_get_brigade
27897 0.5876  libapr-0.so.0.9.5apr_pstrcat
27196 0.5728  libaprutil-0.so.0.9.5heap_bucket_destroy
26658 0.5615  libaprutil-0.so.0.9.5file_bucket_read

--
Brian Akins
Senior Systems Engineer
CNN Internet Technologies