Re: Some Oprofile results
Here's some of the the "gprof" output from Oprofile: % cumulative self self total time samples samplescalls T1/call T1/call name 7.27 154306.00 154306.00 core_input_filter 5.85 278523.00 124217.00 core_output_filter 5.78 401320.00 122797.00 net_time_filter 5.64 521049.00 119729.00 ap_rgetline_core 4.97 626628.00 105579.00 pcre_exec 4.37 719425.00 92797.00 match 3.87 801602.00 82177.00 ap_getword_white 3.41 873893.00 72291.00 _init 3.09 939532.00 65639.00 ap_read_request 2.86 1000314.00 60782.00 add_any_filter_handle 2.81 1060025.00 59711.00 ap_escape_logitem 2.27 1108215.00 48190.00 ap_process_http_connection 2.21 1155055.00 46840.00 ap_http_header_filter 2.15 1200695.00 45640.00 ap_get_brigade 1.80 1239009.00 38314.00 ap_byterange_filter 1.70 1275003.00 35994.00 uniq_field_values 1.61 1309195.00 34192.00 ap_update_child_status_from_indexes 1.52 1341515.00 32320.00 worker_thread 1.52 1373810.00 32295.00 ap_get_mime_headers_core 1.35 1402563.00 28753.00 regexec 1.35 1431145.00 28582.00 ap_pass_brigade 1.25 1457724.00 26579.00 ap_run_post_read_request 1.21 1483482.00 25758.00 ap_process_request 1.16 1508164.00 24682.00 ap_find_token 1.16 1532765.00 24601.00 fix_hostname 1.14 1556887.00 24122.00 ap_http_filter 1.09 1580049.00 23162.00 ap_recent_rfc822_date 1.07 1602729.00 22680.00 ap_run_create_request 1.03 1624644.00 21915.00 ap_update_child_status 0.98 1645365.00 20721.00 read_request_line 0.97 1666036.00 20671.00 core_create_req 0.96 1686333.00 20297.00 ap_content_length_filter 0.85 1704396.00 18063.00 ap_make_method_list 0.73 1719824.00 15428.00 ap_set_keepalive 0.71 1734882.00 15058.00 ap_set_content_type 0.70 1749708.00 14826.00 worker_stack_wait 0.70 1764497.00 14789.00 ap_get_remote_host 0.69 1779171.00 14674.00 worker_stack_awaken_next 0.68 1793672.00 14501.00 ap_save_brigade 0.64 1807320.00 13648.00 perform_idle_server_maintenance 0.60 1820156.00 12836.00 ap_discard_request_body 0.59 1832632.00 12476.00 send_all_header_fields 0.58 1844914.00 12282.00 ap_method_number_of 0.56 1856835.00 11921.00 ap_run_quick_handler 0.55 1868577.00 11742.00 cached_explode 0.54 1880120.00 11543.00 lookup_builtin_method 0.54 1891652.00 11532.00 ap_add_input_filter_handle 0.54 1903038.00 11386.00 ap_get_server_name 0.51 1913943.00 10905.00 ap_update_vhost_from_headers Here's some of the annoted source for core_input_filter: :static int core_input_filter(ap_filter_t *f, apr_bucket_brigade *b, : ap_input_mode_t mode, apr_read_type_e block, : apr_off_t readbytes) 42610 0.5149 :{ /* core_input_filter total: 154306 1.8647 */ :/* ### This is bad. */ 16586 0.2004 :BRIGADE_NORMALIZE(ctx->b); 65812 0.7953 :if (APR_STATUS_IS_EAGAIN(rv)) { :rv = APR_SUCCESS; :} -- Brian Akins Senior Systems Engineer CNN Internet Technologies
Re: Some Oprofile results
Brian Akins wrote: I'm writing an optimized caching module. I've been using 2.0.50. Here are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ): Linux cnnsquid2 2.6.7-cnn.1smp #1 SMP Wed Jun 16 13:41:14 EDT 2004 x86_64 x86_64 x86_64 GNU/Linux Using leader mpm. I patched apr_atomics so that atomics work on x84_64 The serving is serving ~27k requests per second: Are there optimizations to apr_palloc in 2.1, this seems to be a good place to optimize. There's not much left to optimize in apr_palloc itself. In the common case, the execution path consists of a single conditional check and a bit of pointer addition. But reducing the number of calls to apr_palloc could yield an improvement. Do you happen to have a way to enable call-graph profiling on your test system, to find out what the top call chains are for the top 50 functions? (My understanding is that recent releases of oprofile support this with an optional patch.) The large amount of time spent in memcpy could also be a good candidate for optimization. Brian
Re: Some Oprofile results
Brian Akins wrote: Bill Stoddard wrote: I'd certainly be interested in knowing how much faster mod_url_cache is as compared to mod_cache/mod_mem_cache. The only way I can see to -dramatically- improve the performance of mod_cache/mod_mem_cache is to completely bypass i/o filters. I have some patches to mod_mem_cache laying around that give perhaps a 10% performance improvement by prebuilding headers and aligning them on cache boundaries, but 10% is not really that great. Like I said, it can fill up two Gig pipes on a dual Opteron with ~25% idle time. mod_cache can't fill up one. If you run in quick_handler, the only way you get filters (other than core_ouput) is if you add them your self. Guess if I did or not :) If the headers are stored and "thawed" in an efficient manner (both processor and memory) they don't affect the performance much. Also, sometimes mmap is slower than just plain old read. Also, using a filesystem and sendfile is much faster than serving from RAM, plus multiple processes can "share" the same cache. mod_cache doesn't do that. All of these are based on my observartions of our environment. Thanks, that's what I wanted to know. I can pretty much guess the implementation (I've done similar stuff before). mod_mem_cache takes a middle of the road approach; performance is sacrificed to provide some ability to filter i/o. However, it also makes sacrifices in functionality (like access/auth checking) by using the quick_handler to serve pages. Too many compromises for the 'i want full functionality' folks and too many compromises for the 'i want best raw performance' folks. mod_mem_cache tries to be all things to all masters but ends up falling short by all accounts. I've got two more patches lying around that I may dust off if i get any free time :-) The first introduces a config directive that enables moving the mod_mem_cache handler to the traditional handler phase (to pick up full access/auth checking and full filter capabilities). The second patch (I call it my 'go fast patch') does some experiments with prebuilding/caching the response headers and bypassing output filtering (in concept not unlike your mod_url_cache I expect). Bill
Re: Some Oprofile results
Bill Stoddard wrote: I'd certainly be interested in knowing how much faster mod_url_cache is as compared to mod_cache/mod_mem_cache. The only way I can see to -dramatically- improve the performance of mod_cache/mod_mem_cache is to completely bypass i/o filters. I have some patches to mod_mem_cache laying around that give perhaps a 10% performance improvement by prebuilding headers and aligning them on cache boundaries, but 10% is not really that great. Like I said, it can fill up two Gig pipes on a dual Opteron with ~25% idle time. mod_cache can't fill up one. If you run in quick_handler, the only way you get filters (other than core_ouput) is if you add them your self. Guess if I did or not :) If the headers are stored and "thawed" in an efficient manner (both processor and memory) they don't affect the performance much. Also, sometimes mmap is slower than just plain old read. Also, using a filesystem and sendfile is much faster than serving from RAM, plus multiple processes can "share" the same cache. mod_cache doesn't do that. All of these are based on my observartions of our environment. -- Brian Akins Senior Systems Engineer CNN Internet Technologies
Re: Some Oprofile results
Brian Akins wrote: Jean-Jacques Clar wrote: Would you please share with us some info on that optimized caching module. I'll say what I can. I'm hoping I can GPL it. Is mod_url_cache a modified copy of the current ones? No. It borrows some ideas from various other modules, including mod_cache and mod_file_cache. Why are the current ones not fulfilling your needs? In our situation, we know alot about the data that is being cached, so we can be more aggressive with caching. mod_cache cannot cache url that end in /, so it would not help us. Also, mod_cache (either disk or mem) were too slow for us. mod_url_cache (I couldn't come up with a better name) can fill up two Gigabit interfaces on our Opterons and still have ~25% CPU idle time. I'd certainly be interested in knowing how much faster mod_url_cache is as compared to mod_cache/mod_mem_cache. The only way I can see to -dramatically- improve the performance of mod_cache/mod_mem_cache is to completely bypass i/o filters. I have some patches to mod_mem_cache laying around that give perhaps a 10% performance improvement by prebuilding headers and aligning them on cache boundaries, but 10% is not really that great. Bill
Re: Some Oprofile results
Jean-Jacques Clar wrote: Would you please share with us some info on that optimized caching module. I'll say what I can. I'm hoping I can GPL it. Is mod_url_cache a modified copy of the current ones? No. It borrows some ideas from various other modules, including mod_cache and mod_file_cache. Why are the current ones not fulfilling your needs? In our situation, we know alot about the data that is being cached, so we can be more aggressive with caching. mod_cache cannot cache url that end in /, so it would not help us. Also, mod_cache (either disk or mem) were too slow for us. mod_url_cache (I couldn't come up with a better name) can fill up two Gigabit interfaces on our Opterons and still have ~25% CPU idle time. -- Brian Akins Senior Systems Engineer CNN Internet Technologies
Re: Some Oprofile results
Bill Stoddard wrote: I submitted a patch to rework ap_rgetline/core_input_filter a year or so ago that provided some performance improvements. If you are inclined, it might be interesting to dig that patch out and see how it profiles. Sure. Repost-it and/or send it to me. -- Brian Akins Senior Systems Engineer CNN Internet Technologies
Re: Some Oprofile results
>I'm writing an optimized caching module. Would you please share with us some info on that optimized caching module. Is mod_url_cache a modified copy of the current ones? Why are the current ones not fulfilling your needs? Thanks, JJ
Re: Some Oprofile results
Brian Akins wrote: I'm writing an optimized caching module. I've been using 2.0.50. Here are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ): I submitted a patch to rework ap_rgetline/core_input_filter a year or so ago that provided some performance improvements. If you are inclined, it might be interesting to dig that patch out and see how it profiles. Bill
Some Oprofile results
I'm writing an optimized caching module. I've been using 2.0.50. Here are the "top 50" from oprofile ( http://oprofile.sourceforge.net/ ): Linux cnnsquid2 2.6.7-cnn.1smp #1 SMP Wed Jun 16 13:41:14 EDT 2004 x86_64 x86_64 x86_64 GNU/Linux Using leader mpm. I patched apr_atomics so that atomics work on x84_64 The serving is serving ~27k requests per second: Are there optimizations to apr_palloc in 2.1, this seems to be a good place to optimize. CPU: AMD64 processors, speed 2193.21 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10 samples %image name symbol name 3093396.5157 libc-2.3.3.so__GI_memcpy 2537115.3440 libc-2.3.3.so__GI_strlen 1352602.8490 libc-2.3.3.so__GI___strcasecmp 1122652.3647 libapr-0.so.0.9.5apr_palloc 89160 1.8780 libc-2.3.3.so__GI_memset 85604 1.8031 httpdcore_input_filter 84395 1.7776 libaprutil-0.so.0.9.5apr_brigade_create 83813 1.7654 libapr-0.so.0.9.5apr_table_setn 82846 1.7450 httpdmatch 82736 1.7427 libapr-0.so.0.9.5apr_table_get 81288 1.7122 httpdpcre_exec 81011 1.7064 libaprutil-0.so.0.9.5apr_brigade_writev 71233 1.5004 httpdnet_time_filter 70346 1.4817 httpdcore_output_filter 68436 1.4415 mod_setenvif.so match_headers 65654 1.3829 libapr-0.so.0.9.5apr_vformatter 61061 1.2861 libc-2.3.3.so__GI_memchr 58134 1.2245 httpdap_rgetline_core 57236 1.2056 httpdap_getword_white 57038 1.2014 libaprutil-0.so.0.9.5apr_brigade_split_line 55767 1.1746 libapr-0.so.0.9.5apr_pool_cleanup_register 52988 1.1161 mod_log_config.soconfig_log_transaction 48604 1.0238 httpdap_escape_logitem 46163 0.9723 libapr-0.so.0.9.5allocator_alloc 45412 0.9565 libaprutil-0.so.0.9.5apr_brigade_cleanup 45172 0.9515 libpthread-0.61.so __pthread_mutex_lock_internal 41772 0.8799 mod_url_cache.so url_cache_handler 40048 0.8435 libaprutil-0.so.0.9.5MD5Transform 39382 0.8295 httpdap_read_request 39253 0.8268 libapr-0.so.0.9.5apr_socket_timeout_set 38626 0.8136 libpthread-0.61.so __libc_read 38383 0.8085 libaprutil-0.so.0.9.5apr_uri_parse 38003 0.8005 httpdadd_any_filter_handle 36614 0.7712 libapr-0.so.0.9.5apr_poll 36360 0.7659 libaprutil-0.so.0.9.5apr_bucket_heap_make 35417 0.7460 libapr-0.so.0.9.5apr_socket_recv 34016 0.7165 mod_url_cache.so parse_cache_headers 32330 0.6810 libaprutil-0.so.0.9.5apr_bucket_heap_create 31767 0.6691 httpdap_http_header_filter 31348 0.6603 httpdap_process_http_connection 29833 0.6284 libapr-0.so.0.9.5apr_table_addn 29288 0.6169 libaprutil-0.so.0.9.5heap_bucket_read 29273 0.6166 httpdanonymous symbol from section .plt 29122 0.6134 libaprutil-0.so.0.9.5socket_bucket_read 29038 0.6116 libaprutil-0.so.0.9.5apr_bucket_free 28632 0.6031 httpdap_get_brigade 27897 0.5876 libapr-0.so.0.9.5apr_pstrcat 27196 0.5728 libaprutil-0.so.0.9.5heap_bucket_destroy 26658 0.5615 libaprutil-0.so.0.9.5file_bucket_read -- Brian Akins Senior Systems Engineer CNN Internet Technologies