Re: mod-cache-requestor plan

2005-07-22 Thread Parin Shah
This would definitely relieve mod-cache from checking the status of
page every time. But then, we would not be able to keep track of the
popularity of the pages.

But yes, this is a good observation. If we could come up with a
mechanism where we could keep track of popularity of pages (# no of
requests, and last access time) without mod-cache's interference, than
that would be a better approach.

-Parin.


On 7/22/05, Sergio Leonardi <[EMAIL PROTECTED]> wrote:
> The basic approach is ok for me, I just make a note.
> I think that mod_cache should put each cached page in the queue at the time
> its entry in the cache is created (or when its expire time has been
> changed), setting the proper regeneration time in the queue (e.g.
> regeneration time = page expire time - time spent for last page generation).
> 
> In such a way there's no need to lookup for what's expiring, just sleep
> until something needs to be regenerated.
> Bye
> 
> Sergio
> 
> -Original Message-
> From: Parin Shah [mailto:[EMAIL PROTECTED]
> Sent: venerdì 22 luglio 2005 8.02
> To: dev@httpd.apache.org
> Subject: Re: mod-cache-requestor plan
> 
> Thanks Ian, Graham and Sergio for your help.
> 
> for past couple of days I am trying to figure out how our
> mod-cache-requester should spawn thread (or set of threads).
> Currently, I am considering following option. please let me know what
> you think about this approach.
> 
> - mod-cache-requester would be a sub-module in mod-cache as Graham had
> suggested once.
> 
> - it would look similar to mod-mem-cache. it would have provider
> (mod-cache-requester-provider, for lack of any better word for now)
> registered.
> 
> - mod-cache (cache_url_handler to be precise)  will do lookup for this
> provider and will use this provider's methods to push any page which
> is soon-to-be-expired in the priority queue.
> 
> - in the post config of the mod-cache-requester our pqueue would be
> initialized along with mutexes and other stuff.
> 
> - now, we would create new thread (or set of threads) in the post
> config which would basically contain an infinite loop. it (or they)
> will keep checking pqueue and would make sub requests accordingly.
> 
> Does this make sense?
> 
> If this approach is correct then I have some questions regarding
> thread vs process implementation. I would start discussing it once we
> have main architecture in place.
> 
> Thanks,
> Parin.
> 
> On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote:
> > Parin Shah wrote:
> >
> > > 2. how mod-cache-requester can generate the sub request just to reload
> > > the content in the cache.
> >
> > Look inside mod_include - it uses subrequests to be able to embed pages
> > within other pages.
> >
> > Regards,
> > Graham
> > --
> >
> 
>


RE: mod-cache-requestor plan

2005-07-22 Thread Sergio Leonardi
The basic approach is ok for me, I just make a note.
I think that mod_cache should put each cached page in the queue at the time
its entry in the cache is created (or when its expire time has been
changed), setting the proper regeneration time in the queue (e.g.
regeneration time = page expire time - time spent for last page generation).

In such a way there's no need to lookup for what's expiring, just sleep
until something needs to be regenerated.
Bye

Sergio

-Original Message-
From: Parin Shah [mailto:[EMAIL PROTECTED] 
Sent: venerdì 22 luglio 2005 8.02
To: dev@httpd.apache.org
Subject: Re: mod-cache-requestor plan

Thanks Ian, Graham and Sergio for your help. 

for past couple of days I am trying to figure out how our
mod-cache-requester should spawn thread (or set of threads).
Currently, I am considering following option. please let me know what
you think about this approach.

- mod-cache-requester would be a sub-module in mod-cache as Graham had
suggested once.

- it would look similar to mod-mem-cache. it would have provider
(mod-cache-requester-provider, for lack of any better word for now)
registered.

- mod-cache (cache_url_handler to be precise)  will do lookup for this
provider and will use this provider's methods to push any page which
is soon-to-be-expired in the priority queue.

- in the post config of the mod-cache-requester our pqueue would be
initialized along with mutexes and other stuff.

- now, we would create new thread (or set of threads) in the post
config which would basically contain an infinite loop. it (or they)
will keep checking pqueue and would make sub requests accordingly.

Does this make sense? 

If this approach is correct then I have some questions regarding
thread vs process implementation. I would start discussing it once we
have main architecture in place.

Thanks,
Parin.

On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote:
> Parin Shah wrote:
> 
> > 2. how mod-cache-requester can generate the sub request just to reload
> > the content in the cache.
> 
> Look inside mod_include - it uses subrequests to be able to embed pages
> within other pages.
> 
> Regards,
> Graham
> --
>



Re: mod-cache-requestor plan

2005-07-21 Thread Parin Shah
Thanks Ian, Graham and Sergio for your help. 

for past couple of days I am trying to figure out how our
mod-cache-requester should spawn thread (or set of threads).
Currently, I am considering following option. please let me know what
you think about this approach.

- mod-cache-requester would be a sub-module in mod-cache as Graham had
suggested once.

- it would look similar to mod-mem-cache. it would have provider
(mod-cache-requester-provider, for lack of any better word for now)
registered.

- mod-cache (cache_url_handler to be precise)  will do lookup for this
provider and will use this provider's methods to push any page which
is soon-to-be-expired in the priority queue.

- in the post config of the mod-cache-requester our pqueue would be
initialized along with mutexes and other stuff.

- now, we would create new thread (or set of threads) in the post
config which would basically contain an infinite loop. it (or they)
will keep checking pqueue and would make sub requests accordingly.

Does this make sense? 

If this approach is correct then I have some questions regarding
thread vs process implementation. I would start discussing it once we
have main architecture in place.

Thanks,
Parin.

On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote:
> Parin Shah wrote:
> 
> > 2. how mod-cache-requester can generate the sub request just to reload
> > the content in the cache.
> 
> Look inside mod_include - it uses subrequests to be able to embed pages
> within other pages.
> 
> Regards,
> Graham
> --
>


Re: mod-cache-requestor plan

2005-07-20 Thread Graham Leggett

Parin Shah wrote:


2. how mod-cache-requester can generate the sub request just to reload
the content in the cache.


Look inside mod_include - it uses subrequests to be able to embed pages 
within other pages.


Regards,
Graham
--


RE: mod-cache-requestor plan

2005-07-20 Thread Sergio Leonardi
Hi I can just tell you something on point #2.
In my opinion mod-cache-requester should pass in the regeneration request
what a normal user should pass to the system (e.g. cookies, header variables
and so on) because a portion of these data can be relevant in order to
generate the page correctly.
mod-cache currently keeps track of HTTP response (header and body), I think
the best is to keep track of HTTP request too in order to re-run a copy of
it to regenerate the page.
Does it makes sense?

Sergio

-Original Message-
From: Parin Shah [mailto:[EMAIL PROTECTED] 
Sent: mercoledì 20 luglio 2005 8.34
To: dev@httpd.apache.org
Subject: Re: mod-cache-requestor plan

Hi All,

We are now almost at consesus about this new mod-cache-requester
module's mechanism. and now I believe its good time to start
implementing the module.

But before I could do that, I need some help from you guys.

- I am now comfortable with mod-cache, mod-mem-cache, cache_storage.c,
cache_util.c etc.

- But still not too sure how to implement couple of things.

1. How to start the new thread/process for mod-cache-requester when
server start. any similar piece of code would help me a lot.

2. how mod-cache-requester can generate the sub request just to reload
the content in the cache.

3. In current scheme, whenever mod-cache-requester pulls first entry
from pqueue ('refresh' queue) it re-requests it to reload. now by the
time this re-request is done, page might actually have been expired
and removed from cache. in such case should mod-cache reload it or
should wait for next legitimate request.

Your thoughts on any/all on these issues would be really helpful.

Thanks
Parin.

On 7/19/05, Ian Holsman <[EMAIL PROTECTED]> wrote:
> Parin Shah wrote:
> >>you should be using a mix of
> >>
> >># requests
> >>last access time
> >>cost of reproducing the request.
> >>
> >
> >
> > Just to double check, we would insert entry into the 'refresh queue'
> > only if the page is requested and the page is soon-to-be-expired. once
> > it is in the queue we would use above parameters to calculate the
> > priority. Is this correct? or let me know If I have mistaken it.
> >
> yep.
> thats the idea.
> refresh the most-popular pages first.
> 
> >
> >>see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation
> >>of this, which assumes 'length' of request is related to the cost of
> >>reproducing the request.
> >>
> >>the priority queue implementation is sitting in mod_mem_cache, and could
> >>be used to implement the 'refresh' queue I would think.
> >>
> >
> > I feel comfortable with mod-cache and mod-mem-cache code now. but we
> > also need to start new thread/process for mod-cache-requester when
> > server starts. I am not too sure how we could implement it. any
> > pointers to the similar piece of code would be really helpful to me.
> >
> I don't have any code which does this to share with you (others might
> know of some).
> 
> 
> > Thanks,
> > Parin.
> >
> --Ian
> 
>



Re: mod-cache-requestor plan

2005-07-19 Thread Parin Shah
Hi All,

We are now almost at consesus about this new mod-cache-requester
module's mechanism. and now I believe its good time to start
implementing the module.

But before I could do that, I need some help from you guys.

- I am now comfortable with mod-cache, mod-mem-cache, cache_storage.c,
cache_util.c etc.

- But still not too sure how to implement couple of things.

1. How to start the new thread/process for mod-cache-requester when
server start. any similar piece of code would help me a lot.

2. how mod-cache-requester can generate the sub request just to reload
the content in the cache.

3. In current scheme, whenever mod-cache-requester pulls first entry
from pqueue ('refresh' queue) it re-requests it to reload. now by the
time this re-request is done, page might actually have been expired
and removed from cache. in such case should mod-cache reload it or
should wait for next legitimate request.

Your thoughts on any/all on these issues would be really helpful.

Thanks
Parin.

On 7/19/05, Ian Holsman <[EMAIL PROTECTED]> wrote:
> Parin Shah wrote:
> >>you should be using a mix of
> >>
> >># requests
> >>last access time
> >>cost of reproducing the request.
> >>
> >
> >
> > Just to double check, we would insert entry into the 'refresh queue'
> > only if the page is requested and the page is soon-to-be-expired. once
> > it is in the queue we would use above parameters to calculate the
> > priority. Is this correct? or let me know If I have mistaken it.
> >
> yep.
> thats the idea.
> refresh the most-popular pages first.
> 
> >
> >>see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation
> >>of this, which assumes 'length' of request is related to the cost of
> >>reproducing the request.
> >>
> >>the priority queue implementation is sitting in mod_mem_cache, and could
> >>be used to implement the 'refresh' queue I would think.
> >>
> >
> > I feel comfortable with mod-cache and mod-mem-cache code now. but we
> > also need to start new thread/process for mod-cache-requester when
> > server starts. I am not too sure how we could implement it. any
> > pointers to the similar piece of code would be really helpful to me.
> >
> I don't have any code which does this to share with you (others might
> know of some).
> 
> 
> > Thanks,
> > Parin.
> >
> --Ian
> 
>


Re: mod-cache-requestor plan

2005-07-19 Thread Ian Holsman

Parin Shah wrote:

you should be using a mix of

# requests
last access time
cost of reproducing the request.




Just to double check, we would insert entry into the 'refresh queue'
only if the page is requested and the page is soon-to-be-expired. once
it is in the queue we would use above parameters to calculate the
priority. Is this correct? or let me know If I have mistaken it.


yep.
thats the idea.
refresh the most-popular pages first.




see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation
of this, which assumes 'length' of request is related to the cost of
reproducing the request.

the priority queue implementation is sitting in mod_mem_cache, and could
be used to implement the 'refresh' queue I would think.



I feel comfortable with mod-cache and mod-mem-cache code now. but we
also need to start new thread/process for mod-cache-requester when
server starts. I am not too sure how we could implement it. any
pointers to the similar piece of code would be really helpful to me.

I don't have any code which does this to share with you (others might 
know of some).




Thanks,
Parin.


--Ian



Re: mod-cache-requestor plan

2005-07-17 Thread Parin Shah
> you should be using a mix of
> 
> # requests
> last access time
> cost of reproducing the request.
> 

Just to double check, we would insert entry into the 'refresh queue'
only if the page is requested and the page is soon-to-be-expired. once
it is in the queue we would use above parameters to calculate the
priority. Is this correct? or let me know If I have mistaken it.

> see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation
> of this, which assumes 'length' of request is related to the cost of
> reproducing the request.
> 
> the priority queue implementation is sitting in mod_mem_cache, and could
> be used to implement the 'refresh' queue I would think.
> 
I feel comfortable with mod-cache and mod-mem-cache code now. but we
also need to start new thread/process for mod-cache-requester when
server starts. I am not too sure how we could implement it. any
pointers to the similar piece of code would be really helpful to me.

Thanks,
Parin.


Re: mod-cache-requestor plan

2005-07-16 Thread Ian Holsman

Parin Shah wrote:

On 7/15/05, Colm MacCarthaigh <[EMAIL PROTECTED]> wrote:


On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote:


- we need to maintain a counter for url in this case which would
decide the priority of the url. But mainting this counter should be a
low overhead operation, I believe.


Is a counter strictly speaking the right approach? Why not a time of
last access?

I havn't run a statistical analysis but based on my logs the likelyhood
of a url being accessed is very highly correlated to how recently it has
been accessed before. A truly popular page will always have been
accessed recently, a page that is becoming popular (and therefore very
likely to get future hits) will have been accessed recently and a page
who's popularity is rapidly diminishing will not have been accessed
recently.




Last Access Time is definetaly better solution when compared to
counter mechanism. Would like to know other ppl's opinion too.



you should be using a mix of

# requests
last access time
cost of reproducing the request.

see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation 
of this, which assumes 'length' of request is related to the cost of 
reproducing the request.


the priority queue implementation is sitting in mod_mem_cache, and could 
be used to implement the 'refresh' queue I would think.



Thanks,
Parin.





Re: mod-cache-requestor plan

2005-07-16 Thread Parin Shah
On 7/16/05, Graham Leggett <[EMAIL PROTECTED]> wrote:
> Parin Shah wrote:
> 
> > - I would prefer the approach where we maintain priority queue to keep
> > track of popularity. But again you guys have more insight and
> > understanding. so whichever approach you guys decide, I am ready to
> > work on it! ;-)
> 
> Beware of scope creep - we can always start with something simple, like
> a straight list of URLs, and then add the priority later, depends on how
> easy or difficult it is to do.

- Good Point.  We could start with something simple as you said. And
adding priority queue should not be difficult once  we have the basic
mechanism ready.

Thanks,
Parin.


Re: mod-cache-requestor plan

2005-07-16 Thread Graham Leggett

Parin Shah wrote:


- I would prefer the approach where we maintain priority queue to keep
track of popularity. But again you guys have more insight and
understanding. so whichever approach you guys decide, I am ready to
work on it! ;-)


Beware of scope creep - we can always start with something simple, like 
a straight list of URLs, and then add the priority later, depends on how 
easy or difficult it is to do.


Regards,
Graham
--


Re: mod-cache-requestor plan

2005-07-15 Thread Parin Shah
On 7/15/05, Colm MacCarthaigh <[EMAIL PROTECTED]> wrote:
> On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote:
> > - we need to maintain a counter for url in this case which would
> > decide the priority of the url. But mainting this counter should be a
> > low overhead operation, I believe.
> 
> Is a counter strictly speaking the right approach? Why not a time of
> last access?
> 
> I havn't run a statistical analysis but based on my logs the likelyhood
> of a url being accessed is very highly correlated to how recently it has
> been accessed before. A truly popular page will always have been
> accessed recently, a page that is becoming popular (and therefore very
> likely to get future hits) will have been accessed recently and a page
> who's popularity is rapidly diminishing will not have been accessed
> recently.
> 

Last Access Time is definetaly better solution when compared to
counter mechanism. Would like to know other ppl's opinion too.

Thanks,
Parin.


Re: mod-cache-requestor plan

2005-07-15 Thread Colm MacCarthaigh
On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote:
> - we need to maintain a counter for url in this case which would
> decide the priority of the url. But mainting this counter should be a
> low overhead operation, I believe.

Is a counter strictly speaking the right approach? Why not a time of
last access? 

They give slightly different results, but each is more useful than the
other in certain situations. Before htcacheclean existed, I used find
and simply deleted files in order of oldest atime attribute until I had
enough free space. 

That kind of behaviour was very useful in my situation (although it
involved mounting without the noatime mount option, which I dislike for
other reasons).

I havn't run a statistical analysis but based on my logs the likelyhood
of a url being accessed is very highly correlated to how recently it has
been accessed before. A truly popular page will always have been
accessed recently, a page that is becoming popular (and therefore very
likely to get future hits) will have been accessed recently and a page
who's popularity is rapidly diminishing will not have been accessed
recently.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: mod-cache-requestor plan

2005-07-14 Thread Parin Shah
Thanks all for for your thoughts on this issue.

> > The priority re-fetch would make sure the
> > popular pages are always in cache, while others are allowed to die at
> > their expense.
> 
> 
> So every request for an object would update a counter for that url?
> 
- we need to maintain a counter for url in this case which would
decide the priority of the url. But mainting this counter should be a
low overhead operation, I believe.

> Both approaches have disadvantages.  I guess you just have to choose your
> poison :)
> 
- I would prefer the approach where we maintain priority queue to keep
track of popularity. But again you guys have more insight and
understanding. so whichever approach you guys decide, I am ready to
work on it! ;-)

Thanks,
Parin.


Re: mod-cache-requestor plan

2005-07-14 Thread Akins, Brian



On 7/14/05 9:59 AM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:

> 
> that wouldn't keep track of the popularity of the given url, only when
> it is stored. 


Which would be a useful input to something like htcacheclean so that it does
not have to scan directories.

> The priority re-fetch would make sure the
> popular pages are always in cache, while others are allowed to die at
> their expense.


So every request for an object would update a counter for that url?

I still think this would be better handled as an external process with some
"glue" between it and apache (IPC, dbm, shm, etc.).

Both approaches have disadvantages.  I guess you just have to choose your
poison :)


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-14 Thread Ian Holsman

This was a private message. I will continue this one offline.

Akins, Brian wrote:

On 7/13/05 6:36 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:



Hi There.

just remember that this project is Parin's SoC project, and he is
expected to do the code on it.



sure.  I am expected to do what's best for my employer and the httpd
project.




While normally I think it would be great to get a patch, we need parin
to do the work on this, otherwise he might get a bit upset when it comes
to getting paid.




He can have the patch.  I just want to see it done right.







Re: mod-cache-requestor plan

2005-07-14 Thread Ian Holsman

Akins, Brian wrote:



On 7/13/05 6:41 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:



a pool of threads read the queue and start fetching the content, and
re-filling the cache with fresh responses.




How is this better than simply having an external cron job to fetch the
urls?  You have total control of throttling there and it doesn't muck up the
cache code.




A good idea may be to have a "cache store hook" that gets called after a
cache object is stored.  In it, another module could keep track of cached
url's.  This list could be feed to the above cron job.  I know one big web
site that may do it in a similar way...


that wouldn't keep track of the popularity of the given url, only when 
it is stored. I'm guessing the popularity of news stories in CNN
is directly proportional to if they are linked off one of the doors or 
if they have just been published.


other large sites (for example product reviews, or things like 
webshots.com) get most of their traffic indirectly via searches, and not 
directly from a link on a door for example. and have traffic patterns 
more like a ZipF distribution. The priority re-fetch would make sure the 
popular pages are always in cache, while others are allowed to die at 
their expense.


BTW. I'm not saying it's better, I'm just saying it's different, and 
news sites aren't the only large sites in town who need caches.


Regards
Ian







Re: mod-cache-requestor plan

2005-07-14 Thread Akins, Brian



On 7/13/05 6:41 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:

> a pool of threads read the queue and start fetching the content, and
> re-filling the cache with fresh responses.
>

How is this better than simply having an external cron job to fetch the
urls?  You have total control of throttling there and it doesn't muck up the
cache code.

A good idea may be to have a "cache store hook" that gets called after a
cache object is stored.  In it, another module could keep track of cached
url's.  This list could be feed to the above cron job.  I know one big web
site that may do it in a similar way...


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-14 Thread Akins, Brian

On 7/13/05 6:36 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote:

> Hi There.
> 
> just remember that this project is Parin's SoC project, and he is
> expected to do the code on it.

sure.  I am expected to do what's best for my employer and the httpd
project.


> While normally I think it would be great to get a patch, we need parin
> to do the work on this, otherwise he might get a bit upset when it comes
> to getting paid.


He can have the patch.  I just want to see it done right.



-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-13 Thread Ian Holsman

What my initial idea for this was:


we feed the 'soon to be expired' URLs into a priority queue. (similar to 
mod-mem-cache's)


a pool of threads read the queue and start fetching the content, and 
re-filling the cache with fresh responses.


the benefit of this method would be that we control exactly how hard we 
had the back ends, as well as fetching the important stuff first.


this is slightly different than where this thread is going.

I'm open to both, but I think the method below could still result in 
swamping the backend server when lots of unique URLs get requested.


--Ian

Parin Shah wrote:

We have been down this road.  The way one might solve it is to allow
mod_cache to be able to reload an object while serving the "old" one.

Example:

cache /A for 600 seconds

after 500 seconds, request /A with special header (or from special client,
etc) and cache does not serve from cache, but rather pretends the cache has
expired.  do normal refresh stuff.

The cache will continue to server /A even though it is refreshing it




As Graham suggested, such mechanism will not refresh the pages those
are non-popular but expensive to load. which could incur lot of
overhead. But, other than that, This looks really good solution.



Also, one of the flaws of mod_disk_cache (at least the version I am looking
at) is that it deletes objects before reloading them.  It is better for many
reasons to only replace them.  That's the best way to accomplish what I
described above.



If we implement it the way you suggested, then this problem would
automatically be solved.

-Parin.





Re: mod-cache-requestor plan

2005-07-13 Thread Akins, Brian



On 7/13/05 2:43 PM, "Graham Leggett" <[EMAIL PROTECTED]> wrote:

> This was one of the basic design goals of the new cache, but the code
> for it was never written.
> 
> It was logged as a bug against the original v1.3 proxy cache, which
> suffered from thundering herd when cache entries expired.
> 
> At some point soon when my mailbox is a little less full and I have a
> week or two spare, I plan to fix this problem if nobody beats me to it :)


Should only take a couple of hours to get a working patch together.  Maybe
I'll get time this week for mod_disk_cache


-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-13 Thread Graham Leggett

Parin Shah wrote:

- In this case, what would be the criteria to determine which pages 
should be refreshed and which should be left out. intitially I thought 
that all the pages - those are about to expire and have been requested - 
should be refreshed. but, if we consider keeping non-popular but 
expensive pages in the cache, in that case houw would te mod-c-requester 
would make the decision?


The current backend cache modules use the CacheEnable directive to say 
"this backend is valid for this URL space".


In theory mod_cache_requester could use this directive to hook itself 
in. The cache is supposed to allow multiple backend modules the option 
to handle the request (for example, the mem cache might say "nope, too 
big for me" while the disk cache says "yep, can cache that file, let me 
handle it"), so requester would just add itself to the list - it would 
probably need to be first in the list.


If mod_cache_requester was in there, it could add that URL space to the 
list of URL spaces to be freshened on a periodic basis.


You may need an extra cache hook that says "give me all cached URLs 
under URL space XXX", not sure if such a thing exists at the moment.


Keep in mind the cache can store variants of the same URL (for example, 
the same page in different languages, or one compressed and another not 
compressed), which will have to be kept in mind while refreshing.


- considering that mod-cache-requester would be using some mod-cache's 
hooks to query the elements in the cache, would mod-cache-requester be 
still highly dependent on the platform (process vs threads)?


It would be dependant on whether threads or processes are present (or 
maybe both) rather than platform, which is a far simpler case to code for.


Something along the lines of "if (threads or both), create refresher 
thread, else if process only, create refresh process keeping in mind 
that refreshing mem cache won't work".


Regards,
Graham
--


Re: mod-cache-requestor plan

2005-07-13 Thread Graham Leggett

Akins, Brian wrote:


This avoids the "thundering herd" to the backend server/database/whatever
handler.  


Trust me, it works :)


This was one of the basic design goals of the new cache, but the code 
for it was never written.


It was logged as a bug against the original v1.3 proxy cache, which 
suffered from thundering herd when cache entries expired.


At some point soon when my mailbox is a little less full and I have a 
week or two spare, I plan to fix this problem if nobody beats me to it :)


Regards,
Graham
--


Re: mod-cache-requestor plan

2005-07-13 Thread Akins, Brian



On 7/12/05 10:27 PM, "Parin Shah" <[EMAIL PROTECTED]> wrote:

>
>> Also, one of the flaws of mod_disk_cache (at least the version I am looking
>> at) is that it deletes objects before reloading them.  It is better for many
>> reasons to only replace them.  That's the best way to accomplish what I
>> described above.
> 
> If we implement it the way you suggested, then this problem would
> automatically be solved.


The basic flow of mod_disk_cache should be something like:

Determine cache key.

Does meta and data exist?
yes -> check for expire, serve it, etc

No: insert filter, etc

In filter, open a deterministic tmp file, not the "random ones like in
current. Something like metafile.tmp.  When file is opened, try to open it
exclusively.  That way only one worker is trying to cache file.

After caching rename from tmp to real files.


Also by using such a temp file scheme allows you to possible be "sloppy"
with your expiry times:

Does meta file exist? Yes.
Is meta file "fresh"? No.
Does a tmp file exist? Yes, someone else is "refreshing" it.
Is temp file less than x seconds old? Yes.
Serve "stale" content.

This avoids the "thundering herd" to the backend server/database/whatever
handler.  

Trust me, it works :)



-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-12 Thread Parin Shah
> We have been down this road.  The way one might solve it is to allow
> mod_cache to be able to reload an object while serving the "old" one.
> 
> Example:
> 
> cache /A for 600 seconds
> 
> after 500 seconds, request /A with special header (or from special client,
> etc) and cache does not serve from cache, but rather pretends the cache has
> expired.  do normal refresh stuff.
> 
> The cache will continue to server /A even though it is refreshing it
> 

As Graham suggested, such mechanism will not refresh the pages those
are non-popular but expensive to load. which could incur lot of
overhead. But, other than that, This looks really good solution.

> 
> Also, one of the flaws of mod_disk_cache (at least the version I am looking
> at) is that it deletes objects before reloading them.  It is better for many
> reasons to only replace them.  That's the best way to accomplish what I
> described above.

If we implement it the way you suggested, then this problem would
automatically be solved.

-Parin.


Re: mod-cache-requestor plan

2005-07-12 Thread Akins, Brian



On 7/11/05 11:48 PM, "Parin Shah" <[EMAIL PROTECTED]> wrote:

 should be 
> refreshed. but, if we consider keeping non-popular but expensive pages in
> the cache, in that case houw would te mod-c-requester would make the
> decision?
> 

We have been down this road.  The way one might solve it is to allow
mod_cache to be able to reload an object while serving the "old" one.

Example:

cache /A for 600 seconds

after 500 seconds, request /A with special header (or from special client,
etc) and cache does not serve from cache, but rather pretends the cache has
expired.  do normal refresh stuff.

The cache will continue to server /A even though it is refreshing it


Make any sense? This way, a simple cron job can be used to refresh desired
objects.

Also, one of the flaws of mod_disk_cache (at least the version I am looking
at) is that it deletes objects before reloading them.  It is better for many
reasons to only replace them.  That's the best way to accomplish what I
described above.

-- 
Brian Akins
Lead Systems Engineer
CNN Internet Technologies




Re: mod-cache-requestor plan

2005-07-11 Thread Parin Shah
> I believe the basic idea of
forwarding multiple requests on the back end can be a very good idea,
but needs some bounds as Graham suggests.> ..
its an interesting thought. But after Graham's opinion, I am not too
sure about performance improvement/overload incured by threads ratio.
if we could gain significant performance improvement (whouch would be
the when server is lightly loaded), then it is worth to go for multiple
sub-requests with some upper bound.-Parin


Re: mod-cache-requestor plan

2005-07-11 Thread Parin Shah
> - Cache freshness of an URL is checked on each hit to the URL. This runs> the risk of allowing non-popular (but possibly expensive) URLs to expire
> without the chance to be refreshed.> 
> - Cache freshness is checked in an independant thread, which monitors the> cached URLs for freshness at predetermined intervals, and updates them
> automatically and independantly of the frontend.> 
> Either way, it would be useful for mod_cache_requester to operate
> independantly of the cache serving requests, so that "cache freshening"> doesn't slow down the frontend.
> > I would vote for the second option - a "cache spider" that keeps it fresh.
> -
In this case, what would be the criteria to determine which pages
should be refreshed and which should be left out. intitially I thought
that all the pages - those are about to expire and have been requested
- should be refreshed. but, if we consider keeping non-popular but
expensive pages in the cache, in that case houw would te
mod-c-requester would make the decision?> Once mod_cache_requester has decided that a URL needs to be "freshened",
> all it needs to do is to make a subrequest to that URL setting the> relevant Cache-Control headers to tell it to refresh the cache, and let
> the normal caching mechanism take it's course.- hmm. this seems to be the most elegant solution.
> mod_cache_requester would probably be a submodule of mod_cache, using> mod_cache provided hooks to query elements in the cache.
-
considering that mod-cache-requester would be using some mod-cache's
hooks to query the elements in the cache, would mod-cache-requester be
still highly dependent on the platform (process vs threads)?Thanks a lot for all this valuable information, Graham.Parin.


Re: mod-cache-requestor plan

2005-07-11 Thread [EMAIL PROTECTED]
Hi all
I basically agree with Graham, with just one observation on multi-threaded 
subrequests.
I believe the basic idea of forwarding multiple requests on the back end can be 
a very good idea, but needs some bounds as Graham suggests.

In my opinion you can define a mod_cache_requester connection pool to the back 
end server which could be pool limited in order to avoid back end saturation.
Using this approach you could desing a mod_cache_requester in such a way:
- use a priority queue to keep track of needed caching refresh requests and 
scheduled time for caching refresh (this latter data is the sorting method for 
the priority)
- each time an URL is requested for the first time you should cache request 
data (in addition to response header) and fill the priority queue with required 
data
- each mod_cache_requester thread can read from the queue one URL and pass the 
request (stored previously) to the back end

In such a way you can realize an "optimized" requester.
What do you think of it?

   Sergio

> Da: "Graham Leggett" 
>
> Parin Shah said:
> 
> > When the page expires from the cache, it is removed from cache and
> > thus next request has to wait until that page is reloaded by the
> > back-end server.
> 
> This is not strictly true - when a page expires from the cache, a
> conditional request is sent to the backend server, and if a fresher
> version is available it is updated, otherwise the existing cache contents
> are left alone. Place was left in the original cache design for serving
> multiple requests of the same non-fresh URL without fetching the backend
> URL many times, but this has not yet been implemented.
> 
> The option to guarantee freshness of the cache is a very useful feature
> though.
> 
> > Here is the overview of how am I planning to implement it.
> >
> > 1. when a page is requested and it exists in the cache, mod_cache
> > checks the expiry time of the page.
> >
> > 2. If (expiry time – current time)  < Some_Constant_Value,
> > then mod-cache notifies mod_cache_requester about this page.
> > This communication between mod_cache and mod_cache_requester should
> > incur least overhead as this would affect current request's response
> > time.
> 
> There are two approaches to this:
> 
> - Cache freshness of an URL is checked on each hit to the URL. This runs
> the risk of allowing non-popular (but possibly expensive) URLs to expire
> without the chance to be refreshed.
> 
> - Cache freshness is checked in an independant thread, which monitors the
> cached URLs for freshness at predetermined intervals, and updates them
> automatically and independantly of the frontend.
> 
> Either way, it would be useful for mod_cache_requester to operate
> independantly of the cache serving requests, so that "cache freshening"
> doesn't slow down the frontend.
> 
> I would vote for the second option - a "cache spider" that keeps it fresh.
> 
> > 3. mod_cache_requester will re-request the page which is soon-to-expire.
> > Each such request is done through separate thread so that multiple
> > pages could be re-requested simultaneously.
> 
> Once mod_cache_requester has decided that a URL needs to be "freshened",
> all it needs to do is to make a subrequest to that URL setting the
> relevant Cache-Control headers to tell it to refresh the cache, and let
> the normal caching mechanism take it's course.
> 
> Putting the subrequests into separate threads isn't necessarily a good
> idea, as you don't want to put a sudden simultaneous load onto the backend
> server, or take up too much processing power of the frontend itself. You
> also probably want to keep things simple.
> 
> > This request would force the server to reload the content of the page
> > into the cache even if it is already there. (this would reset the
> > expiry time of the page and thus it would be able to stay in the cache
> > for longer duration.)
> 
> The cache code should already do this.
> 
> > Please let me know what you think about this module. Also I have some
> > questions  and your help would be really useful.
> >
> > 1.what would be the best way for communication between mod_cache and
> > mod_cache_requester.  I believe that keeping  mod_cache_requester in a
> > separate thread would be the best way.
> 
> mod_cache_requester will need access to the backend caches so that it can
> query freshness. This is done through hooks made available for mod_cache
> to do the same thing.
> 
> Firing off a separate thread/process for mod_cache_requester can be done
> when the server starts up and the module is initialised, however keep in
> mind some of the limitations of threads and processes:
> 
> - If the platform supports threads, then you can monitor the disk cache,
> the memory cache, and the shared memory cache.
> - If the platform supports processes, then you can monitor the disk cache
> and shared memory cache only.
> 
> > 2.How should the mod_cache_requester send the re-request to the main
> > server.
> 
> You fire off a su

Re: mod-cache-requestor plan

2005-07-11 Thread Graham Leggett
Parin Shah said:

> When the page expires from the cache, it is removed from cache and
> thus next request has to wait until that page is reloaded by the
> back-end server.

This is not strictly true - when a page expires from the cache, a
conditional request is sent to the backend server, and if a fresher
version is available it is updated, otherwise the existing cache contents
are left alone. Place was left in the original cache design for serving
multiple requests of the same non-fresh URL without fetching the backend
URL many times, but this has not yet been implemented.

The option to guarantee freshness of the cache is a very useful feature
though.

> Here is the overview of how am I planning to implement it.
>
> 1. when a page is requested and it exists in the cache, mod_cache
> checks the expiry time of the page.
>
> 2. If (expiry time – current time)  < Some_Constant_Value,
> then mod-cache notifies mod_cache_requester about this page.
> This communication between mod_cache and mod_cache_requester should
> incur least overhead as this would affect current request's response
> time.

There are two approaches to this:

- Cache freshness of an URL is checked on each hit to the URL. This runs
the risk of allowing non-popular (but possibly expensive) URLs to expire
without the chance to be refreshed.

- Cache freshness is checked in an independant thread, which monitors the
cached URLs for freshness at predetermined intervals, and updates them
automatically and independantly of the frontend.

Either way, it would be useful for mod_cache_requester to operate
independantly of the cache serving requests, so that "cache freshening"
doesn't slow down the frontend.

I would vote for the second option - a "cache spider" that keeps it fresh.

> 3. mod_cache_requester will re-request the page which is soon-to-expire.
> Each such request is done through separate thread so that multiple
> pages could be re-requested simultaneously.

Once mod_cache_requester has decided that a URL needs to be "freshened",
all it needs to do is to make a subrequest to that URL setting the
relevant Cache-Control headers to tell it to refresh the cache, and let
the normal caching mechanism take it's course.

Putting the subrequests into separate threads isn't necessarily a good
idea, as you don't want to put a sudden simultaneous load onto the backend
server, or take up too much processing power of the frontend itself. You
also probably want to keep things simple.

> This request would force the server to reload the content of the page
> into the cache even if it is already there. (this would reset the
> expiry time of the page and thus it would be able to stay in the cache
> for longer duration.)

The cache code should already do this.

> Please let me know what you think about this module. Also I have some
> questions  and your help would be really useful.
>
> 1.what would be the best way for communication between mod_cache and
> mod_cache_requester.  I believe that keeping  mod_cache_requester in a
> separate thread would be the best way.

mod_cache_requester will need access to the backend caches so that it can
query freshness. This is done through hooks made available for mod_cache
to do the same thing.

Firing off a separate thread/process for mod_cache_requester can be done
when the server starts up and the module is initialised, however keep in
mind some of the limitations of threads and processes:

- If the platform supports threads, then you can monitor the disk cache,
the memory cache, and the shared memory cache.
- If the platform supports processes, then you can monitor the disk cache
and shared memory cache only.

> 2.How should the mod_cache_requester send the re-request to the main
> server.

You fire off a subrequest to an URL, and throw away the data that comes back.

For some example code, look at mod_include.

> 3.Other than these questions, any suggestion/correction is welcome.
> Any pointers to the details of related modules( mod-cache,
> communication between mod-cache and backend server) would be helpful
> too.

Keep in mind that mod_cache is a framework, into which sub-modules are
plugged to do the work of the backend caching.

mod_cache_requester would probably be a submodule of mod_cache, using
mod_cache provided hooks to query elements in the cache.

Regards,
Graham
--



Re: mod-cache-requestor plan

2005-07-11 Thread [EMAIL PROTECTED]
Hi Parin
I'm a newbie too and I was preparing a clear_cache command in order to force 
cleaning of some desired path.
Your idea is very good.
My personal considerations are:
1 - you have to avoid multiple requests to the back end: if the page is very 
frequently clicked your mechanism can bring the system to perform many 
regeneration requests to the back end, so maybe the best is to use a queue to 
purge duplicated requests
2 - if you already know the expiry time of the pages (when the page is inserted 
into the queue you know) you could "book" the request for the regeneration 
instead of waiting for a request close to the expiration
3 - it could be useful to perform regeneration just of "highly clicked" URLs, 
introducing a heuristic to define what's a "highly clicked" URL in way similar 
to swapped-out pages from memory in virtual memory management

Bye

 Sergio

> Da: Parin Shah <[EMAIL PROTECTED]>
> Data: Sun, 10 Jul 2005 23:24:10 -0500
> A: dev@httpd.apache.org
> Oggetto: mod-cache-requestor plan
>
> Hi All,
> 
> I am a newbie. I am going to work on mod-cache and a new module
> mod-cache-requester as a part of Soc program.
> 
> Small description of the module is as follows.
> 
> When the page expires from the cache, it is removed from cache and
> thus next request has to wait until that page is reloaded by the
> back-end server. But if we add one more module which re-request the
> soon-to-expire pages, in that case such pages wont be removed from the
> cache and thus would reduce the response time.
> 
> Here is the overview of how am I planning to implement it.
> 
> 1. when a page is requested and it exists in the cache, mod_cache
> checks the expiry time of the page.
> 
> 2. If (expiry time – current time)  < Some_Constant_Value,
> then mod-cache notifies mod_cache_requester about this page. 
> This communication between mod_cache and mod_cache_requester should
> incur least overhead as this would affect current request's response
> time.
> 
> 3. mod_cache_requester will re-request the page which is soon-to-expire.
> Each such request is done through separate thread so that multiple
> pages could be re-requested simultaneously.
> 
> This request would force the server to reload the content of the page
> into the cache even if it is already there. (this would reset the
> expiry time of the page and thus it would be able to stay in the cache
> for longer duration.)
> 
> Please let me know what you think about this module. Also I have some
> questions  and your help would be really useful.
> 
> 1.what would be the best way for communication between mod_cache and
> mod_cache_requester.  I believe that keeping  mod_cache_requester in a
> separate thread would be the best way.
> 
> 2.How should the mod_cache_requester send the re-request to the main
> server. I believe that sending it as if the request has come from the
> some client would be the best way to implement. But we need to attach
> some special status with this request so that cache_lookup  is
> bypassed and output_filter is not added as we dont need to stream the
> output.
> 
> 3.Other than these questions, any suggestion/correction is welcome.
> Any pointers to the details of related modules( mod-cache,
> communication between mod-cache and backend server) would be helpful
> too.
> 
> Thanks,
> Parin.
>