serf was Re: mod_cache wishlist

2005-08-24 Thread Justin Erenkrantz

--On August 24, 2005 4:21:38 PM +0100 Nick Kew <[EMAIL PROTECTED]> wrote:


So, is it time we introduced a general-purpose apr_http_client?
I'd be prepared to offer my code as a startingpoint, but I'd rather
not take the driving seat for further development and documentation.


We already went down this road with serf and it eventually got booted by 
the APR PMC and the HTTP Server PMC refused to consider adding an HTTP 
client library.  ("We don't do clients.  Go away.")


Serf, which uses APR, lives on here:

SVN: 

serf-dev lists: 

Serf does async, SSL, deflate (gzip *and* deflate!), pipelining, etc.  It's 
licensed under ALv2.  For those coming from httpd-land, a lot of it will 
make sense.  It uses apr_pollset, so it gets KEvent/epoll for free.


FWIW, we created the serf-dev mailing list only after Greg and I got tired 
of trading private emails; alas, neither one of us have had mailing list 
discussions since we created it.  -- justin


Re: mod_cache wishlist

2005-08-24 Thread Brian Akins

Paul Querna wrote:



FWIW, I will be looking at adding support for EPoll and/or KQueue to the
curl_multi_* interface sometime soonish for 'work.



on the curl-dev list, I suggested just using libevent 
(http://www.monkey.org/~provos/libevent/), because it already 
encapsulates all that.


--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-24 Thread Paul Querna
Brian Akins wrote:
> Nick Kew wrote:
> 
>> Alternatively, maybe someone could post an executive summary of the
>> problems and benefits of standardising on libcurl?
> 
> 
> I'm pretty familiar with libcurl.  Great library.  Here are some issues
> I have had:
> 
> - asynchronous uses select only.

FWIW, I will be looking at adding support for EPoll and/or KQueue to the
curl_multi_* interface sometime soonish for 'work.

-Paul


Re: mod_cache wishlist

2005-08-24 Thread Parin Shah
On 8/24/05, Colm MacCarthaigh <[EMAIL PROTECTED]> wrote:
> On Wed, Aug 24, 2005 at 09:18:54AM -0500, Parin Shah wrote:
> > > > I have fixed that memory leak problem. also added script to include
> > > > libcurl whenever this module is included.
> > >
> > > I hope that it doesn't mean that libcurl is going to be a permanent
> > > solution, when subrequests (with minor changes) could serve the same
> > > purpose.
> >
> > Certainly not, We would have mod-c-requester which uses sub-requests
> > and not libcurl eventually. but my initial reaction after going
> > through the subrequest code was that it may require significant
> > refactoring.
> 
> Will it work if you mark the subrequests as proxy requests? IE, the same
> approach as mod_rewrite and the P flag.
> 
We had considered proxy requests before. But many of us felt its not
good idea to use proxy requests for re-requests. main reason was that
proxy-requests creates new connection to the main server.

I have not checked mod_rewrite code yet. so I would go through it and
if it seems to solve our problem then it would be great.

Thanks for your input.

> It introduces a dependency on mod_proxy, but curl introduces a
> dependency on libcurl, so that's not so bad.

You are right. I believe, that shouldnt be a problem.


Re: mod_cache wishlist

2005-08-24 Thread Brian Akins

Nick Kew wrote:


Alternatively, maybe someone could post an executive summary of the
problems and benefits of standardising on libcurl?


I'm pretty familiar with libcurl.  Great library.  Here are some issues 
I have had:


- asynchronous uses select only.
- random crashes with openssl, normal bug stuff.
- POST'ing is somewhat of a mystery.  It works, but you have to tinker 
with it alot.



--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-24 Thread Nick Kew

Colm MacCarthaigh wrote:

On Wed, Aug 24, 2005 at 09:18:54AM -0500, Parin Shah wrote:


I have fixed that memory leak problem. also added script to include
libcurl whenever this module is included.


I hope that it doesn't mean that libcurl is going to be a permanent
solution, when subrequests (with minor changes) could serve the same
purpose.


Certainly not, We would have mod-c-requester which uses sub-requests
and not libcurl eventually. but my initial reaction after going
through the subrequest code was that it may require significant
refactoring.



Will it work if you mark the subrequests as proxy requests? IE, the same
approach as mod_rewrite and the P flag.

It introduces a dependency on mod_proxy, but curl introduces a
dependency on libcurl, so that's not so bad.


I've been through this loop a few times, and not really found a
satisfactory solution.  Most recently for mod_publisher, which includes
what could perhaps become an apr_http_client module.  But that's not
a path I want to embark on in isolation, when we have proxy and
subrequest code already there.

When I wrote that, I did look at the existing code, but reusing it
seemed more trouble than it was worth (this was before last year's
major refactoring of the proxy, FWIW).  AFAICT none of the existing
modules exports an API for it.

My code is basically HTTP/1.1 with keepalives and connection pooling,
but no asynchronous operation.  It's also DIY-with-fudges, and bugs
like not supporting HTTP 3xx responses that aren't redirects.
I think libcurl offers much richer capabilities, doesn't it?

So, is it time we introduced a general-purpose apr_http_client?
I'd be prepared to offer my code as a startingpoint, but I'd rather
not take the driving seat for further development and documentation.

Alternatively, maybe someone could post an executive summary of the
problems and benefits of standardising on libcurl?

--
Nick Kew
#ifndef APX_HTTP
#define APX_HTTP

#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 

typedef struct apx_http_entity apx_http_entity ;
typedef struct apx_http_request apx_http_request ;
typedef struct apx_http_response apx_http_response ;
typedef struct apx_http_connection apx_http_connection ;
typedef struct apx_http_connection_pool apx_http_connection_pool ;

struct apx_http_entity {
  const char* type ;
  size_t length ;
  void* data ;
} ;
typedef enum {
	HTTP_NONE ,
	HTTP_READY ,
	HTTP_SENDING_FIXED_DATA ,
	HTTP_SENDING_CHUNKED_DATA ,
	HTTP_REQUEST_SENT ,
	HTTP_READING_DATA ,
	HTTP_READING_FIXED_DATA ,
	HTTP_READING_CHUNKED_DATA ,
	HTTP_ERROR
} apx_conn_state_t ;

#define BUFLEN 4096
struct apx_http_connection {
  const char* host ;
  int port ;
  apx_conn_state_t state ;
  int is_proxy ;
  apr_socket_t* sock ;
  unsigned int timeout ;
  size_t length ;
  size_t bytes ;
  size_t header_bytes ;
  size_t clen ;
  char buf[BUFLEN] ;
  char savebuf[16] ;
} ;
struct apx_http_connection_pool {
  const char* proxy_host ;
  int proxy_port ;
  union {
apx_http_connection* proxy_conn ;
apr_hash_t* connections ;
  } conn ;
} ;
struct apx_http_request {
  const char* method ;
  apr_uri_t uri ;
  apr_table_t* headers ;
  int sent ;
  const char* errmsg ;
  apx_http_entity* contents ;
} ;
struct apx_http_response {
  int status ;
  const char* reason ;
  apr_table_t* headers ;
};

int apx_uri_resolve_relative(apr_pool_t *p,
	const apr_uri_t *base, apr_uri_t *uptr);

int apx_uri_parse_relative(apr_pool_t *p,
	const apr_uri_t *base, const char* uri, apr_uri_t* uptr);

const char* apx_http_new_request(apr_pool_t* pool, apx_http_request* req,
const char* method, const char* url, const apr_uri_t* base) ;
apx_http_connection* apx_http_make_connection(apr_pool_t* pool,
apx_http_connection_pool* connpool, apx_http_request* req) ;
void apx_http_connection_set_timeout(apx_http_connection* conn, unsigned int timeout) ;
void apx_http_send(apx_http_connection* conn, const char* str) ;
const char* apx_http_send_request(apr_pool_t* pool, apx_http_request* req,
apx_http_connection* conn) ;
const char* apx_http_get_response(apr_pool_t* pool, apx_http_connection* conn,
apx_http_response* resp) ;
apx_http_connection_pool* apx_http_new_connection_pool(apr_pool_t* pool,
	const char* host, int port) ;
apr_status_t apx_http_close_connection(void* conn) ;
apr_bucket* apx_http_get_data(apr_pool_t* pool, apr_bucket_alloc_t* alloc,
	apx_http_connection* conn) ;
const char* apx_http_do(apr_pool_t* pool, apx_http_connection_pool* cpool,
const char* method, const char* url, const apr_uri_t* base,
apx_http_connection** connp, apx_http_response* resp,
apr_bucket_alloc_t* alloc, int depth) ;

#endif


Re: mod_cache wishlist

2005-08-24 Thread Colm MacCarthaigh
On Wed, Aug 24, 2005 at 09:18:54AM -0500, Parin Shah wrote:
> > > I have fixed that memory leak problem. also added script to include
> > > libcurl whenever this module is included.
> > 
> > I hope that it doesn't mean that libcurl is going to be a permanent
> > solution, when subrequests (with minor changes) could serve the same
> > purpose.
> 
> Certainly not, We would have mod-c-requester which uses sub-requests
> and not libcurl eventually. but my initial reaction after going
> through the subrequest code was that it may require significant
> refactoring.

Will it work if you mark the subrequests as proxy requests? IE, the same
approach as mod_rewrite and the P flag.

It introduces a dependency on mod_proxy, but curl introduces a
dependency on libcurl, so that's not so bad.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: mod_cache wishlist

2005-08-24 Thread Parin Shah
> > I have fixed that memory leak problem. also added script to include
> > libcurl whenever this module is included.
> 
> I hope that it doesn't mean that libcurl is going to be a permanent
> solution, when subrequests (with minor changes) could serve the same
> purpose.

Certainly not, We would have mod-c-requester which uses sub-requests
and not libcurl eventually. but my initial reaction after going
through the subrequest code was that it may require significant
refactoring.

> 
> BTW: if subrequests are refactored, isn't it better to move them to
> APR/APR-UTIL?
> 

Not too sure about this. let us wait for other guys' opinions.

> In any case, thanks for the great contribution!
> --
> Eli Marmor
> [EMAIL PROTECTED]
> Netmask (El-Mar) Internet Technologies Ltd.
> __
> Tel.:   +972-9-766-1020  8 Yad-Harutzim St.
> Fax.:   +972-9-766-1314  P.O.B. 7004
> Mobile: +972-50-5237338  Kfar-Saba 44641, Israel
>


Re: mod_cache wishlist

2005-08-23 Thread Eli Marmor
Parin Shah wrote:

> I have fixed that memory leak problem. also added script to include
> libcurl whenever this module is included.

I hope that it doesn't mean that libcurl is going to be a permanent
solution, when subrequests (with minor changes) could serve the same
purpose.

BTW: if subrequests are refactored, isn't it better to move them to
APR/APR-UTIL?

In any case, thanks for the great contribution!
-- 
Eli Marmor
[EMAIL PROTECTED]
Netmask (El-Mar) Internet Technologies Ltd.
__
Tel.:   +972-9-766-1020  8 Yad-Harutzim St.
Fax.:   +972-9-766-1314  P.O.B. 7004
Mobile: +972-50-5237338  Kfar-Saba 44641, Israel


Re: mod_cache wishlist

2005-08-23 Thread Parin Shah
Hi,

I have fixed that memory leak problem. also added script to include
libcurl whenever this module is included.

http://utdallas.edu/~parinshah/mod-c-requester.0.3.tar.gz

Thanks,
Parin.

On 8/23/05, Parin Shah <[EMAIL PROTECTED]> wrote:
> > > ohh, I thought I was taking care of it. I mean, code frees the memory
> > > when no longer needed except during the shutdown of server. anyway I
> > > will go through the code again to check that. Also feel free to point
> > > out the code which is causing memory leak problem.
> >
> > I'll look through it as well.  Big thing I noticed was in regards to curl.
> >
> > for every call to curl_easy_init() you need a call to curl_easy_cleanup()
> >
> > Also, you must call curl_slist_free_all() to free the list
> >
> quite right. I will go ahead and fix this.
> 
> > Also, you must call curl_slist_free_all() to free the list
> >
> >
> > If you want to use libcurl, you may want to use a reslist of curl
> > handles.  curl can do all the keepalive stuff and you would avoid the
> > overhead of constantly creating deleting curls.  Just call
> > curl_easy_reset before giving it back to reslist.
> >
> Good point. This would improve performance for sure. Thanks for the 
> suggestion.
> 
> -Parin.
>


Re: mod_cache wishlist

2005-08-23 Thread Parin Shah
> > ohh, I thought I was taking care of it. I mean, code frees the memory
> > when no longer needed except during the shutdown of server. anyway I
> > will go through the code again to check that. Also feel free to point
> > out the code which is causing memory leak problem.
> 
> I'll look through it as well.  Big thing I noticed was in regards to curl.
> 
> for every call to curl_easy_init() you need a call to curl_easy_cleanup()
> 
> Also, you must call curl_slist_free_all() to free the list
> 
quite right. I will go ahead and fix this. 

> Also, you must call curl_slist_free_all() to free the list
> 
> 
> If you want to use libcurl, you may want to use a reslist of curl
> handles.  curl can do all the keepalive stuff and you would avoid the
> overhead of constantly creating deleting curls.  Just call
> curl_easy_reset before giving it back to reslist.
> 
Good point. This would improve performance for sure. Thanks for the suggestion.

-Parin.


Re: mod_cache wishlist

2005-08-23 Thread Parin Shah
> > ohh, I thought I was taking care of it. I mean, code frees the memory
> > when no longer needed except during the shutdown of server. anyway I
> > will go through the code again to check that. Also feel free to point
> > out the code which is causing memory leak problem.
> 
> I'll look through it as well.  Big thing I noticed was in regards to curl.
> 
> for every call to curl_easy_init() you need a call to curl_easy_cleanup()
> 
quite right. I will go ahead and fix this. 

> Also, you must call curl_slist_free_all() to free the list
> 
> 
> If you want to use libcurl, you may want to use a reslist of curl
> handles.  curl can do all the keepalive stuff and you would avoid the
> overhead of constantly creating deleting curls.  Just call
> curl_easy_reset before giving it back to reslist.
> 
Good point. This would improve performance for sure. Thanks for the suggestion.

-Parin.


Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Joshua Slive wrote:

It looks like you want an extreme level of flexibility for making 
caching decisions based on characteristics of the request.  Why not 
piggy-back on mod_rewrite, which already has an absurdly complex 
matching capability.


As in
RewriteCond {REMOTE_ADDR} =127.0.0.1
RewriteRule /path.* - [cacheenable=disk]



Good point. If the per-dir stuff gets folded in, this is probably a far 
more flexible and elegant way.



I didn't want to use SetEnvIf, because there's no concenpt of and or 
last.  rewrite solves both of those.



Of course, in my endless search for flexibility, it would be cool if it 
were easy to add more %{VARIABLES} to mod_rewrite...



Who is currently working on the per-dir mod_cache stuff?  I am willing 
to help.




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Joshua Slive


Brian Akins wrote:


Some pseudo match configs and code:


It looks like you want an extreme level of flexibility for making 
caching decisions based on characteristics of the request.  Why not 
piggy-back on mod_rewrite, which already has an absurdly complex 
matching capability.


As in
RewriteCond {REMOTE_ADDR} =127.0.0.1
RewriteRule /path.* - [cacheenable=disk]

(Of course, the simpler directives would be kept for people who don't 
need the added flexibility.)


Joshua.


Re: mod_cache wishlist

2005-08-23 Thread Brian Akins


Some pseudo match configs and code:

Just examples, maybe not useful or doable.



#only cache things in /stuff when request comes from localhost
CacheEnable disk client=127.0.0.1 path=/stuff

#disable cacheing from special host
CacheDisable client=10.1.1.1.10

#don't cache any ssl stuff
CacheDisable protocol=https

#be evil and cache post's
CacheEnable disk method=POST

#cache everything that ends in .gif
CacheEnable disk path=\.gif$

#or maybe regex=\.gif$


#Only cache php from test hosts
CacheEnable mem path=\.php$ client=10.0.0.0/255.0.0.0


#also have not's
CacheEnable disk path!=\.gif$

# A twist on Colm's suggestion
#don't cache if env NOCACHE is set
CacheDisable env=NOCACHE


modules could register init and match functions:

typedef const char *ap_cache_match_init(apr_pool_t *p, server_rec *s, 
const char *arg, void **ptr);


typedef int ap_cache_match_func(request_rec *r, const void *arg);


APR_DECLARE_OPTIONAL_FN(apr_status_t, ap_cache_register_match,
(const char *name, ap_cache_match_init *init,
 ap_cache_match_func *func));

Example match functions:

static void *regex_init(apr_pool_t *p, server_rec *s, const char *arg, 
void **ptr) {


if((*ptr = ap_pregcomp(p, arg, REG_EXTENDED)) == NULL) {
return apr_psprintf(p,
"regex compilation failed for %s",
arg);
}
return NULL;
}

static int regex_func(request_rec *r, const void *arg) {
regmatch_t m[MAX_REG_MATCH];

return (!regexec((regex_t)*arg, r->uri, MAX_REG_MATCH, m, 0));
}


/*then in pre_cong maybe register it*/
static APR_OPTIONAL_FN_TYPE(ap_cache_register_match) *pfn_register;

pfn_register = APR_RETRIEVE_OPTIONAL_FN(ap_cache_register_match);

pfn_register("regex", regex_init, regex_func);


--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Bill Stoddard

Brian Akins wrote:

Bill Stoddard wrote:

I've not looked at your code so I can't make specific recommendations. 
Just remember, code allocated with any of the apr_pool functions is 
freed only when that pool is reclaimed (end of a request for a request 
pool, shutdown of the server for pconf, etc.). mod_mem_cache uses the 
functions 'cache_hash', a 'de-poolized' apr_hash, for this exact reason.




Or, you could use lots of sub-pools...


I didn't use subpools in mod_mem_cache to minimize the storage overhead, but you point is well taken. There 
other ways to fix the problem and subpools may be a better tradeoff.


Bill



Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Bill Stoddard wrote:

I've not looked at your code so I can't make specific recommendations. 
Just remember, code allocated with any of the apr_pool functions is 
freed only when that pool is reclaimed (end of a request for a request 
pool, shutdown of the server for pconf, etc.). mod_mem_cache uses the 
functions 'cache_hash', a 'de-poolized' apr_hash, for this exact reason.



Or, you could use lots of sub-pools...




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Bill Stoddard

Parin Shah wrote:

Cool. Very good start.  Leaks memory like a sieve, but good start.



ohh, I thought I was taking care of it. I mean, code frees the memory
when no longer needed except during the shutdown of server. anyway I
will go through the code again to check that. Also feel free to point
out the code which is causing memory leak problem.


I've not looked at your code so I can't make specific recommendations. Just remember, code allocated with any 
of the apr_pool functions is freed only when that pool is reclaimed (end of a request for a request pool, 
shutdown of the server for pconf, etc.). mod_mem_cache uses the functions 'cache_hash', a 'de-poolized' 
apr_hash, for this exact reason.


Bill



Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Colm MacCarthaigh wrote:


per-dir does not help in quick_handler.



No, but it is useful at the save stage.


True.  This would probably be fine.  I would like to see more flexible 
url matching, ala squid.  Perhaps a way so that modules can register 
their own matching functions.


Example.

CacheDisable 127.0.0.1

and some module registers a "match function" that operates on client IP.






I personally prefer the former, it's just more like the rest of httpd.



I must say that I agree with that.




Would SetEnvIf, plus something like;

CacheNoStoreEnvVar  NOCACHE

do the job? Ie, have a catchall;

CacheEnable provider/

but then if the environment variable specified is found at store time,
don't store it to the cache?



Hmmm.. that could work..  It would fit into the current httpd way of 
doing things.


On a different note, it would be great to have a larger number of 
containers (Directory, location, File, etc)  and an easy way to declare 
more.


I'd like to be able to do things like:


foo bar



foo baz



foo bar>


etc...





The concept of a manager and so on, I'd definitely use. 



If the interface was in hooks/option functions, it would be very flexible.


--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Parin Shah wrote:


ohh, I thought I was taking care of it. I mean, code frees the memory
when no longer needed except during the shutdown of server. anyway I
will go through the code again to check that. Also feel free to point
out the code which is causing memory leak problem.


I'll look through it as well.  Big thing I noticed was in regards to curl.

for every call to curl_easy_init() you need a call to curl_easy_cleanup()

Also, you must call curl_slist_free_all() to free the list


If you want to use libcurl, you may want to use a reslist of curl 
handles.  curl can do all the keepalive stuff and you would avoid the 
overhead of constantly creating deleting curls.  Just call 
curl_easy_reset before giving it back to reslist.




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Parin Shah
> 
> Cool. Very good start.  Leaks memory like a sieve, but good start.
> 
ohh, I thought I was taking care of it. I mean, code frees the memory
when no longer needed except during the shutdown of server. anyway I
will go through the code again to check that. Also feel free to point
out the code which is causing memory leak problem.


> It would be cool if we could find a way to use Apache's subrequest stuff
> rather than curl.  One less dependency you know.  I've also had issues
> with libcurl and ssl randomly coreing.
> 
I would also prefer using subrequest instead of libcurl. but when I
reviewed make_sub_request code, I couldnt find a way to use it as
mod-c-requester doesnt have connection and request available.
I believe we might need some re-factoring to make sub-request work in
such scenarios. so for now I am using libcurl and would continue
investigating other ways.

Thanks,
Parin.


Re: mod_cache wishlist

2005-08-23 Thread Colm MacCarthaigh
On Tue, Aug 23, 2005 at 10:18:37AM -0400, Brian Akins wrote:
> >I've been looking at this, and it's possibly the Syntax that put me off,
> >but it looks painful on the admin, and probably on the server too.
> >There's nothing in those examples that can't be achieved by making the
> >non-CacheEnable cache directives per-dir, which I'd personally prefer
> >(and have working). 
> 
> per-dir does not help in quick_handler.

No, but it is useful at the save stage.

> Which is clearer?
> 
> CacheEnable disk /stuff
> 
>  CacheExpiresDefault 600 CacheExpiresMaximum 2600
> CacheExpiresMinimum 300 
> 
> or
> 
> CacheRule provider=disk enable=on path=/stuff default=600 max=2600
> min=300
> 
> Maybe a "phlisophical" question.  Perhaps per-dir would solve this
> issue.

I personally prefer the former, it's just more like the rest of httpd.

> >But it's years since I've run squid in production, what kind of
> >complex rules do people use?
> 
> usually there is a catch all rule at the bottom, with lots of
> exceptions above it.  In squid the match can be against a variety of
> things not just the path, like protocol, host, client ip, arbitrary
> header, etc.

Would SetEnvIf, plus something like;

CacheNoStoreEnvVar  NOCACHE

do the job? Ie, have a catchall;

CacheEnable provider/

but then if the environment variable specified is found at store time,
don't store it to the cache?

> with dbd, for example, you could say something like "expire all
> objects whose path begins with /stuff or has .jpg in the path".  The
> overhead is not as high as you would think, you only touch the
> database when somethign is stored or you do a purge/expire.

The concept of a manager and so on, I'd definitely use. 

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Parin Shah wrote:



I am already working on it. I have also posted initial version of this module. 


http://utdallas.edu/~parinshah/mod-c-requester.0.2.tar.gz

-Parin.


Cool. Very good start.  Leaks memory like a sieve, but good start.

It would be cool if we could find a way to use Apache's subrequest stuff 
rather than curl.  One less dependency you know.  I've also had issues 
with libcurl and ssl randomly coreing.




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Brian Akins

Colm MacCarthaigh wrote:



To a large extent mod_cache_requester (which from inspection seems to be
much further along than I thought) will solve this problem :)


True.  I still think we need deterministic temp files so that several 
threads are not simultaneously trying to cache the same object.



I've been looking at this, and it's possibly the Syntax that put me off,
but it looks painful on the admin, and probably on the server too.
There's nothing in those examples that can't be achieved by making the
non-CacheEnable cache directives per-dir, which I'd personally prefer
(and have working). 



per-dir does not help in quick_handler.  While the examples I gave (min, 
max, etc) are written to the cache_file, other config options may need 
to be know before we try to serve it..


BTW, the syntax is squid-like.  With great flexibility comes complexity. 
 But, I can see your point as we would be using per_dir_merges.


Which is clearer?

CacheEnable disk /stuff


CacheExpiresDefault 600
CacheExpiresMaximum 2600
CacheExpiresMinimum 300


or

CacheRule provider=disk enable=on path=/stuff default=600 max=2600 min=300


Maybe a "phlisophical" question.  Perhaps per-dir would solve this issue.



But it's years since I've run squid in production, what kind of complex
rules do people use?


usually there is a catch all rule at the bottom, with lots of exceptions 
above it.  In squid the match can be against a variety of things not 
just the path, like protocol, host, client ip, arbitrary header, etc.





That's what I had in-mind with htcacheadmin, but a module version does
make a lot of sense, purging a memory cache for example. Though I'm not
sure of the need for a dbd, that's a lot of overhead when an admin
could just specify a URL they want to expire manually.


with dbd, for example, you could say something like "expire all objects 
whose path begins with /stuff or has .jpg in the path".  The overhead is 
not as high as you would think, you only touch the database when 
somethign is stored or you do a purge/expire.


This of course, by use of hooks, would be a per module implementation. 
mod_cache_managher_mysql, mod_cache_manager_yaml, mod_cache_manager_whatever



--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: mod_cache wishlist

2005-08-23 Thread Parin Shah
> >
> > Content definitely should not be served from the cache after it has
> > expired imo. However I think an approach like;
> >
> > if((now + interval) > expired) {
> > if(!stat(tmpfile)) {
> >   update_cache_from_backend();
> >   }
> > }
> >
> > ie "revalidate the cache content after N-seconds before it is due to be
> > expired" would have the same effect, but avoid serving stale content.
> 
> To a large extent mod_cache_requester (which from inspection seems to be
> much further along than I thought) will solve this problem :)
> 
I am already working on it. I have also posted initial version of this module. 

http://utdallas.edu/~parinshah/mod-c-requester.0.2.tar.gz

-Parin.


Re: mod_cache wishlist

2005-08-23 Thread Colm MacCarthaigh
On Tue, Aug 23, 2005 at 08:42:48AM -0400, Brian Akins wrote:
> Deterministic temp files to avoid "thundering herd":
> 
> http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112430743432417&w=2
> 
> Especially Colm's comments:
> 
> Content definitely should not be served from the cache after it has
> expired imo. However I think an approach like;
> 
> if((now + interval) > expired) {
> if(!stat(tmpfile)) {
>   update_cache_from_backend();
>   }
> }
> 
> ie "revalidate the cache content after N-seconds before it is due to be
> expired" would have the same effect, but avoid serving stale content.

To a large extent mod_cache_requester (which from inspection seems to be
much further along than I thought) will solve this problem :)

> CacheVaryOverride:
> 
> http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112430527822904&w=2
> 
> This necessary in the case of reverse proxies, not that useful for 
> normal proxies.

Is useful though.

> Replace current CacheEnable/Disable with a more squid like approach:
> 
> http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112447805318053&w=2

I've been looking at this, and it's possibly the Syntax that put me off,
but it looks painful on the admin, and probably on the server too.
There's nothing in those examples that can't be achieved by making the
non-CacheEnable cache directives per-dir, which I'd personally prefer
(and have working). 

But it's years since I've run squid in production, what kind of complex
rules do people use?

> Add a hook that gets called after store_body so that other modules can 
> track cache usage.  The other modules could use the mythical 
> ap_cache_query() I described above to update cache usage in a database, 
> file, dbm, shared memory, etc.  A great use would be if this information 
> was in a database (using mod_dbd, perhaps) so that an admin (or script) 
> could selectively expire and purge cache entries.  This could also lead 
> to a much more efficient version of htcacheclean that did not have to 
> crawl the directory tree.

That's what I had in-mind with htcacheadmin, but a module version does
make a lot of sense, purging a memory cache for example. Though I'm not
sure of the need for a dbd, that's a lot of overhead when an admin
could just specify a URL they want to expire manually.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


mod_cache wishlist

2005-08-23 Thread Brian Akins
With the talk of "direction" of Apache, I though I, as an end user and 
developer, would offer my "wish list" for mod_cache.  We have been using 
squid for various things, but are now mostly using Apache plus our own 
custom cache module.  Our module has grown to support most of the "cool" 
features of squid we liked and many other features.  Unfortunantly, I 
can't just post the code :(


I would like to standardize on the "stock" Apache mod_cache for various 
reasons (bug fixes, mind share, etc.)  However, we now have services 
that depend upon the features of our cache module.  I think some of the 
ideas we have implemented are useful to others as well.


I am willing to code if there is hope of some of it being committed.  We 
would really like to use stock mod_cache and are willing to submit 
patches so that it can meet our requirements.


Finally, the stock mod_cache is about 15% slower that our module in 
files around the 10k range.  With larger files, it easily fills a gig 
interface.  However, with very small files (<4k), it slows down 
considerably, I am working to find the source of the slow down.


Anyway, here's my list:

My wish list for mod_cache.c

---
Deterministic temp files to avoid "thundering herd":

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112430743432417&w=2

Especially Colm's comments:

Content definitely should not be served from the cache after it has
expired imo. However I think an approach like;

if((now + interval) > expired) {
if(!stat(tmpfile)) {
update_cache_from_backend();
}
}

ie "revalidate the cache content after N-seconds before it is due to be
expired" would have the same effect, but avoid serving stale content.

---

CacheVaryOverride:

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112430527822904&w=2

This necessary in the case of reverse proxies, not that useful for 
normal proxies.


---

Replace current CacheEnable/Disable with a more squid like approach:

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112447805318053&w=2

---

Configurable cache statuses.  I know, for example, that I want to cache 
404's and 302's in my reverse proxy setup.


---

Ability to query cache objects from other modules.  For example,

apr_size_t len;

ap_cache_query(r, CACHE_SIZE, &len);

would give me the size of the cached object (headers + data).  and 
something like:


char **file;
ap_cache_query(r, CACHE_HEADERS_FILE, &file);

Would give me the name of the file on disk for the headers file for 
mod_disk_cache, but NULL for mod_mem_cache.c


Interface could be similar to libcurl's curl_easy_getinfo(): 
http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html


---

Add a hook that gets called after store_body so that other modules can 
track cache usage.  The other modules could use the mythical 
ap_cache_query() I described above to update cache usage in a database, 
file, dbm, shared memory, etc.  A great use would be if this information 
was in a database (using mod_dbd, perhaps) so that an admin (or script) 
could selectively expire and purge cache entries.  This could also lead 
to a much more efficient version of htcacheclean that did not have to 
crawl the directory tree.


---

Another idea from squid:

Pre-make all cache directories in mod_disk_cache (post_config or 
auxiliary script).  Eliminates some overhead and simplifies the code. 
Use directories like 0/0, 0/1, 0/2, ... z/y, z/y.


Also, have vary just regenerate a new key.  The make a .vary directory 
is ugly.


---

Faster!  Some patches that gave us speed improvements:

only cache req_hdrs if vary:

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112447694826332&w=2


speed up read_table:

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=112437134119057&w=2

--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies