Re: mod-cache-requestor plan
This would definitely relieve mod-cache from checking the status of page every time. But then, we would not be able to keep track of the popularity of the pages. But yes, this is a good observation. If we could come up with a mechanism where we could keep track of popularity of pages (# no of requests, and last access time) without mod-cache's interference, than that would be a better approach. -Parin. On 7/22/05, Sergio Leonardi <[EMAIL PROTECTED]> wrote: > The basic approach is ok for me, I just make a note. > I think that mod_cache should put each cached page in the queue at the time > its entry in the cache is created (or when its expire time has been > changed), setting the proper regeneration time in the queue (e.g. > regeneration time = page expire time - time spent for last page generation). > > In such a way there's no need to lookup for what's expiring, just sleep > until something needs to be regenerated. > Bye > > Sergio > > -Original Message- > From: Parin Shah [mailto:[EMAIL PROTECTED] > Sent: venerdì 22 luglio 2005 8.02 > To: dev@httpd.apache.org > Subject: Re: mod-cache-requestor plan > > Thanks Ian, Graham and Sergio for your help. > > for past couple of days I am trying to figure out how our > mod-cache-requester should spawn thread (or set of threads). > Currently, I am considering following option. please let me know what > you think about this approach. > > - mod-cache-requester would be a sub-module in mod-cache as Graham had > suggested once. > > - it would look similar to mod-mem-cache. it would have provider > (mod-cache-requester-provider, for lack of any better word for now) > registered. > > - mod-cache (cache_url_handler to be precise) will do lookup for this > provider and will use this provider's methods to push any page which > is soon-to-be-expired in the priority queue. > > - in the post config of the mod-cache-requester our pqueue would be > initialized along with mutexes and other stuff. > > - now, we would create new thread (or set of threads) in the post > config which would basically contain an infinite loop. it (or they) > will keep checking pqueue and would make sub requests accordingly. > > Does this make sense? > > If this approach is correct then I have some questions regarding > thread vs process implementation. I would start discussing it once we > have main architecture in place. > > Thanks, > Parin. > > On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote: > > Parin Shah wrote: > > > > > 2. how mod-cache-requester can generate the sub request just to reload > > > the content in the cache. > > > > Look inside mod_include - it uses subrequests to be able to embed pages > > within other pages. > > > > Regards, > > Graham > > -- > > > >
RE: mod-cache-requestor plan
The basic approach is ok for me, I just make a note. I think that mod_cache should put each cached page in the queue at the time its entry in the cache is created (or when its expire time has been changed), setting the proper regeneration time in the queue (e.g. regeneration time = page expire time - time spent for last page generation). In such a way there's no need to lookup for what's expiring, just sleep until something needs to be regenerated. Bye Sergio -Original Message- From: Parin Shah [mailto:[EMAIL PROTECTED] Sent: venerdì 22 luglio 2005 8.02 To: dev@httpd.apache.org Subject: Re: mod-cache-requestor plan Thanks Ian, Graham and Sergio for your help. for past couple of days I am trying to figure out how our mod-cache-requester should spawn thread (or set of threads). Currently, I am considering following option. please let me know what you think about this approach. - mod-cache-requester would be a sub-module in mod-cache as Graham had suggested once. - it would look similar to mod-mem-cache. it would have provider (mod-cache-requester-provider, for lack of any better word for now) registered. - mod-cache (cache_url_handler to be precise) will do lookup for this provider and will use this provider's methods to push any page which is soon-to-be-expired in the priority queue. - in the post config of the mod-cache-requester our pqueue would be initialized along with mutexes and other stuff. - now, we would create new thread (or set of threads) in the post config which would basically contain an infinite loop. it (or they) will keep checking pqueue and would make sub requests accordingly. Does this make sense? If this approach is correct then I have some questions regarding thread vs process implementation. I would start discussing it once we have main architecture in place. Thanks, Parin. On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote: > Parin Shah wrote: > > > 2. how mod-cache-requester can generate the sub request just to reload > > the content in the cache. > > Look inside mod_include - it uses subrequests to be able to embed pages > within other pages. > > Regards, > Graham > -- >
Re: mod-cache-requestor plan
Thanks Ian, Graham and Sergio for your help. for past couple of days I am trying to figure out how our mod-cache-requester should spawn thread (or set of threads). Currently, I am considering following option. please let me know what you think about this approach. - mod-cache-requester would be a sub-module in mod-cache as Graham had suggested once. - it would look similar to mod-mem-cache. it would have provider (mod-cache-requester-provider, for lack of any better word for now) registered. - mod-cache (cache_url_handler to be precise) will do lookup for this provider and will use this provider's methods to push any page which is soon-to-be-expired in the priority queue. - in the post config of the mod-cache-requester our pqueue would be initialized along with mutexes and other stuff. - now, we would create new thread (or set of threads) in the post config which would basically contain an infinite loop. it (or they) will keep checking pqueue and would make sub requests accordingly. Does this make sense? If this approach is correct then I have some questions regarding thread vs process implementation. I would start discussing it once we have main architecture in place. Thanks, Parin. On 7/20/05, Graham Leggett <[EMAIL PROTECTED]> wrote: > Parin Shah wrote: > > > 2. how mod-cache-requester can generate the sub request just to reload > > the content in the cache. > > Look inside mod_include - it uses subrequests to be able to embed pages > within other pages. > > Regards, > Graham > -- >
Re: mod-cache-requestor plan
Parin Shah wrote: 2. how mod-cache-requester can generate the sub request just to reload the content in the cache. Look inside mod_include - it uses subrequests to be able to embed pages within other pages. Regards, Graham --
RE: mod-cache-requestor plan
Hi I can just tell you something on point #2. In my opinion mod-cache-requester should pass in the regeneration request what a normal user should pass to the system (e.g. cookies, header variables and so on) because a portion of these data can be relevant in order to generate the page correctly. mod-cache currently keeps track of HTTP response (header and body), I think the best is to keep track of HTTP request too in order to re-run a copy of it to regenerate the page. Does it makes sense? Sergio -Original Message- From: Parin Shah [mailto:[EMAIL PROTECTED] Sent: mercoledì 20 luglio 2005 8.34 To: dev@httpd.apache.org Subject: Re: mod-cache-requestor plan Hi All, We are now almost at consesus about this new mod-cache-requester module's mechanism. and now I believe its good time to start implementing the module. But before I could do that, I need some help from you guys. - I am now comfortable with mod-cache, mod-mem-cache, cache_storage.c, cache_util.c etc. - But still not too sure how to implement couple of things. 1. How to start the new thread/process for mod-cache-requester when server start. any similar piece of code would help me a lot. 2. how mod-cache-requester can generate the sub request just to reload the content in the cache. 3. In current scheme, whenever mod-cache-requester pulls first entry from pqueue ('refresh' queue) it re-requests it to reload. now by the time this re-request is done, page might actually have been expired and removed from cache. in such case should mod-cache reload it or should wait for next legitimate request. Your thoughts on any/all on these issues would be really helpful. Thanks Parin. On 7/19/05, Ian Holsman <[EMAIL PROTECTED]> wrote: > Parin Shah wrote: > >>you should be using a mix of > >> > >># requests > >>last access time > >>cost of reproducing the request. > >> > > > > > > Just to double check, we would insert entry into the 'refresh queue' > > only if the page is requested and the page is soon-to-be-expired. once > > it is in the queue we would use above parameters to calculate the > > priority. Is this correct? or let me know If I have mistaken it. > > > yep. > thats the idea. > refresh the most-popular pages first. > > > > >>see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation > >>of this, which assumes 'length' of request is related to the cost of > >>reproducing the request. > >> > >>the priority queue implementation is sitting in mod_mem_cache, and could > >>be used to implement the 'refresh' queue I would think. > >> > > > > I feel comfortable with mod-cache and mod-mem-cache code now. but we > > also need to start new thread/process for mod-cache-requester when > > server starts. I am not too sure how we could implement it. any > > pointers to the similar piece of code would be really helpful to me. > > > I don't have any code which does this to share with you (others might > know of some). > > > > Thanks, > > Parin. > > > --Ian > >
Re: mod-cache-requestor plan
Hi All, We are now almost at consesus about this new mod-cache-requester module's mechanism. and now I believe its good time to start implementing the module. But before I could do that, I need some help from you guys. - I am now comfortable with mod-cache, mod-mem-cache, cache_storage.c, cache_util.c etc. - But still not too sure how to implement couple of things. 1. How to start the new thread/process for mod-cache-requester when server start. any similar piece of code would help me a lot. 2. how mod-cache-requester can generate the sub request just to reload the content in the cache. 3. In current scheme, whenever mod-cache-requester pulls first entry from pqueue ('refresh' queue) it re-requests it to reload. now by the time this re-request is done, page might actually have been expired and removed from cache. in such case should mod-cache reload it or should wait for next legitimate request. Your thoughts on any/all on these issues would be really helpful. Thanks Parin. On 7/19/05, Ian Holsman <[EMAIL PROTECTED]> wrote: > Parin Shah wrote: > >>you should be using a mix of > >> > >># requests > >>last access time > >>cost of reproducing the request. > >> > > > > > > Just to double check, we would insert entry into the 'refresh queue' > > only if the page is requested and the page is soon-to-be-expired. once > > it is in the queue we would use above parameters to calculate the > > priority. Is this correct? or let me know If I have mistaken it. > > > yep. > thats the idea. > refresh the most-popular pages first. > > > > >>see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation > >>of this, which assumes 'length' of request is related to the cost of > >>reproducing the request. > >> > >>the priority queue implementation is sitting in mod_mem_cache, and could > >>be used to implement the 'refresh' queue I would think. > >> > > > > I feel comfortable with mod-cache and mod-mem-cache code now. but we > > also need to start new thread/process for mod-cache-requester when > > server starts. I am not too sure how we could implement it. any > > pointers to the similar piece of code would be really helpful to me. > > > I don't have any code which does this to share with you (others might > know of some). > > > > Thanks, > > Parin. > > > --Ian > >
Re: mod-cache-requestor plan
Parin Shah wrote: you should be using a mix of # requests last access time cost of reproducing the request. Just to double check, we would insert entry into the 'refresh queue' only if the page is requested and the page is soon-to-be-expired. once it is in the queue we would use above parameters to calculate the priority. Is this correct? or let me know If I have mistaken it. yep. thats the idea. refresh the most-popular pages first. see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation of this, which assumes 'length' of request is related to the cost of reproducing the request. the priority queue implementation is sitting in mod_mem_cache, and could be used to implement the 'refresh' queue I would think. I feel comfortable with mod-cache and mod-mem-cache code now. but we also need to start new thread/process for mod-cache-requester when server starts. I am not too sure how we could implement it. any pointers to the similar piece of code would be really helpful to me. I don't have any code which does this to share with you (others might know of some). Thanks, Parin. --Ian
Re: mod-cache-requestor plan
> you should be using a mix of > > # requests > last access time > cost of reproducing the request. > Just to double check, we would insert entry into the 'refresh queue' only if the page is requested and the page is soon-to-be-expired. once it is in the queue we would use above parameters to calculate the priority. Is this correct? or let me know If I have mistaken it. > see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation > of this, which assumes 'length' of request is related to the cost of > reproducing the request. > > the priority queue implementation is sitting in mod_mem_cache, and could > be used to implement the 'refresh' queue I would think. > I feel comfortable with mod-cache and mod-mem-cache code now. but we also need to start new thread/process for mod-cache-requester when server starts. I am not too sure how we could implement it. any pointers to the similar piece of code would be really helpful to me. Thanks, Parin.
Re: mod-cache-requestor plan
Parin Shah wrote: On 7/15/05, Colm MacCarthaigh <[EMAIL PROTECTED]> wrote: On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote: - we need to maintain a counter for url in this case which would decide the priority of the url. But mainting this counter should be a low overhead operation, I believe. Is a counter strictly speaking the right approach? Why not a time of last access? I havn't run a statistical analysis but based on my logs the likelyhood of a url being accessed is very highly correlated to how recently it has been accessed before. A truly popular page will always have been accessed recently, a page that is becoming popular (and therefore very likely to get future hits) will have been accessed recently and a page who's popularity is rapidly diminishing will not have been accessed recently. Last Access Time is definetaly better solution when compared to counter mechanism. Would like to know other ppl's opinion too. you should be using a mix of # requests last access time cost of reproducing the request. see memcache_gdsf_algorithm() in mod_mem_cache.c for an implementation of this, which assumes 'length' of request is related to the cost of reproducing the request. the priority queue implementation is sitting in mod_mem_cache, and could be used to implement the 'refresh' queue I would think. Thanks, Parin.
Re: mod-cache-requestor plan
On 7/16/05, Graham Leggett <[EMAIL PROTECTED]> wrote: > Parin Shah wrote: > > > - I would prefer the approach where we maintain priority queue to keep > > track of popularity. But again you guys have more insight and > > understanding. so whichever approach you guys decide, I am ready to > > work on it! ;-) > > Beware of scope creep - we can always start with something simple, like > a straight list of URLs, and then add the priority later, depends on how > easy or difficult it is to do. - Good Point. We could start with something simple as you said. And adding priority queue should not be difficult once we have the basic mechanism ready. Thanks, Parin.
Re: mod-cache-requestor plan
Parin Shah wrote: - I would prefer the approach where we maintain priority queue to keep track of popularity. But again you guys have more insight and understanding. so whichever approach you guys decide, I am ready to work on it! ;-) Beware of scope creep - we can always start with something simple, like a straight list of URLs, and then add the priority later, depends on how easy or difficult it is to do. Regards, Graham --
Re: mod-cache-requestor plan
On 7/15/05, Colm MacCarthaigh <[EMAIL PROTECTED]> wrote: > On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote: > > - we need to maintain a counter for url in this case which would > > decide the priority of the url. But mainting this counter should be a > > low overhead operation, I believe. > > Is a counter strictly speaking the right approach? Why not a time of > last access? > > I havn't run a statistical analysis but based on my logs the likelyhood > of a url being accessed is very highly correlated to how recently it has > been accessed before. A truly popular page will always have been > accessed recently, a page that is becoming popular (and therefore very > likely to get future hits) will have been accessed recently and a page > who's popularity is rapidly diminishing will not have been accessed > recently. > Last Access Time is definetaly better solution when compared to counter mechanism. Would like to know other ppl's opinion too. Thanks, Parin.
Re: mod-cache-requestor plan
On Fri, Jul 15, 2005 at 01:23:29AM -0500, Parin Shah wrote: > - we need to maintain a counter for url in this case which would > decide the priority of the url. But mainting this counter should be a > low overhead operation, I believe. Is a counter strictly speaking the right approach? Why not a time of last access? They give slightly different results, but each is more useful than the other in certain situations. Before htcacheclean existed, I used find and simply deleted files in order of oldest atime attribute until I had enough free space. That kind of behaviour was very useful in my situation (although it involved mounting without the noatime mount option, which I dislike for other reasons). I havn't run a statistical analysis but based on my logs the likelyhood of a url being accessed is very highly correlated to how recently it has been accessed before. A truly popular page will always have been accessed recently, a page that is becoming popular (and therefore very likely to get future hits) will have been accessed recently and a page who's popularity is rapidly diminishing will not have been accessed recently. -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: mod-cache-requestor plan
Thanks all for for your thoughts on this issue. > > The priority re-fetch would make sure the > > popular pages are always in cache, while others are allowed to die at > > their expense. > > > So every request for an object would update a counter for that url? > - we need to maintain a counter for url in this case which would decide the priority of the url. But mainting this counter should be a low overhead operation, I believe. > Both approaches have disadvantages. I guess you just have to choose your > poison :) > - I would prefer the approach where we maintain priority queue to keep track of popularity. But again you guys have more insight and understanding. so whichever approach you guys decide, I am ready to work on it! ;-) Thanks, Parin.
Re: mod-cache-requestor plan
On 7/14/05 9:59 AM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: > > that wouldn't keep track of the popularity of the given url, only when > it is stored. Which would be a useful input to something like htcacheclean so that it does not have to scan directories. > The priority re-fetch would make sure the > popular pages are always in cache, while others are allowed to die at > their expense. So every request for an object would update a counter for that url? I still think this would be better handled as an external process with some "glue" between it and apache (IPC, dbm, shm, etc.). Both approaches have disadvantages. I guess you just have to choose your poison :) -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
This was a private message. I will continue this one offline. Akins, Brian wrote: On 7/13/05 6:36 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: Hi There. just remember that this project is Parin's SoC project, and he is expected to do the code on it. sure. I am expected to do what's best for my employer and the httpd project. While normally I think it would be great to get a patch, we need parin to do the work on this, otherwise he might get a bit upset when it comes to getting paid. He can have the patch. I just want to see it done right.
Re: mod-cache-requestor plan
Akins, Brian wrote: On 7/13/05 6:41 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: a pool of threads read the queue and start fetching the content, and re-filling the cache with fresh responses. How is this better than simply having an external cron job to fetch the urls? You have total control of throttling there and it doesn't muck up the cache code. A good idea may be to have a "cache store hook" that gets called after a cache object is stored. In it, another module could keep track of cached url's. This list could be feed to the above cron job. I know one big web site that may do it in a similar way... that wouldn't keep track of the popularity of the given url, only when it is stored. I'm guessing the popularity of news stories in CNN is directly proportional to if they are linked off one of the doors or if they have just been published. other large sites (for example product reviews, or things like webshots.com) get most of their traffic indirectly via searches, and not directly from a link on a door for example. and have traffic patterns more like a ZipF distribution. The priority re-fetch would make sure the popular pages are always in cache, while others are allowed to die at their expense. BTW. I'm not saying it's better, I'm just saying it's different, and news sites aren't the only large sites in town who need caches. Regards Ian
Re: mod-cache-requestor plan
On 7/13/05 6:41 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: > a pool of threads read the queue and start fetching the content, and > re-filling the cache with fresh responses. > How is this better than simply having an external cron job to fetch the urls? You have total control of throttling there and it doesn't muck up the cache code. A good idea may be to have a "cache store hook" that gets called after a cache object is stored. In it, another module could keep track of cached url's. This list could be feed to the above cron job. I know one big web site that may do it in a similar way... -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
On 7/13/05 6:36 PM, "Ian Holsman" <[EMAIL PROTECTED]> wrote: > Hi There. > > just remember that this project is Parin's SoC project, and he is > expected to do the code on it. sure. I am expected to do what's best for my employer and the httpd project. > While normally I think it would be great to get a patch, we need parin > to do the work on this, otherwise he might get a bit upset when it comes > to getting paid. He can have the patch. I just want to see it done right. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
What my initial idea for this was: we feed the 'soon to be expired' URLs into a priority queue. (similar to mod-mem-cache's) a pool of threads read the queue and start fetching the content, and re-filling the cache with fresh responses. the benefit of this method would be that we control exactly how hard we had the back ends, as well as fetching the important stuff first. this is slightly different than where this thread is going. I'm open to both, but I think the method below could still result in swamping the backend server when lots of unique URLs get requested. --Ian Parin Shah wrote: We have been down this road. The way one might solve it is to allow mod_cache to be able to reload an object while serving the "old" one. Example: cache /A for 600 seconds after 500 seconds, request /A with special header (or from special client, etc) and cache does not serve from cache, but rather pretends the cache has expired. do normal refresh stuff. The cache will continue to server /A even though it is refreshing it As Graham suggested, such mechanism will not refresh the pages those are non-popular but expensive to load. which could incur lot of overhead. But, other than that, This looks really good solution. Also, one of the flaws of mod_disk_cache (at least the version I am looking at) is that it deletes objects before reloading them. It is better for many reasons to only replace them. That's the best way to accomplish what I described above. If we implement it the way you suggested, then this problem would automatically be solved. -Parin.
Re: mod-cache-requestor plan
On 7/13/05 2:43 PM, "Graham Leggett" <[EMAIL PROTECTED]> wrote: > This was one of the basic design goals of the new cache, but the code > for it was never written. > > It was logged as a bug against the original v1.3 proxy cache, which > suffered from thundering herd when cache entries expired. > > At some point soon when my mailbox is a little less full and I have a > week or two spare, I plan to fix this problem if nobody beats me to it :) Should only take a couple of hours to get a working patch together. Maybe I'll get time this week for mod_disk_cache -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
Parin Shah wrote: - In this case, what would be the criteria to determine which pages should be refreshed and which should be left out. intitially I thought that all the pages - those are about to expire and have been requested - should be refreshed. but, if we consider keeping non-popular but expensive pages in the cache, in that case houw would te mod-c-requester would make the decision? The current backend cache modules use the CacheEnable directive to say "this backend is valid for this URL space". In theory mod_cache_requester could use this directive to hook itself in. The cache is supposed to allow multiple backend modules the option to handle the request (for example, the mem cache might say "nope, too big for me" while the disk cache says "yep, can cache that file, let me handle it"), so requester would just add itself to the list - it would probably need to be first in the list. If mod_cache_requester was in there, it could add that URL space to the list of URL spaces to be freshened on a periodic basis. You may need an extra cache hook that says "give me all cached URLs under URL space XXX", not sure if such a thing exists at the moment. Keep in mind the cache can store variants of the same URL (for example, the same page in different languages, or one compressed and another not compressed), which will have to be kept in mind while refreshing. - considering that mod-cache-requester would be using some mod-cache's hooks to query the elements in the cache, would mod-cache-requester be still highly dependent on the platform (process vs threads)? It would be dependant on whether threads or processes are present (or maybe both) rather than platform, which is a far simpler case to code for. Something along the lines of "if (threads or both), create refresher thread, else if process only, create refresh process keeping in mind that refreshing mem cache won't work". Regards, Graham --
Re: mod-cache-requestor plan
Akins, Brian wrote: This avoids the "thundering herd" to the backend server/database/whatever handler. Trust me, it works :) This was one of the basic design goals of the new cache, but the code for it was never written. It was logged as a bug against the original v1.3 proxy cache, which suffered from thundering herd when cache entries expired. At some point soon when my mailbox is a little less full and I have a week or two spare, I plan to fix this problem if nobody beats me to it :) Regards, Graham --
Re: mod-cache-requestor plan
On 7/12/05 10:27 PM, "Parin Shah" <[EMAIL PROTECTED]> wrote: > >> Also, one of the flaws of mod_disk_cache (at least the version I am looking >> at) is that it deletes objects before reloading them. It is better for many >> reasons to only replace them. That's the best way to accomplish what I >> described above. > > If we implement it the way you suggested, then this problem would > automatically be solved. The basic flow of mod_disk_cache should be something like: Determine cache key. Does meta and data exist? yes -> check for expire, serve it, etc No: insert filter, etc In filter, open a deterministic tmp file, not the "random ones like in current. Something like metafile.tmp. When file is opened, try to open it exclusively. That way only one worker is trying to cache file. After caching rename from tmp to real files. Also by using such a temp file scheme allows you to possible be "sloppy" with your expiry times: Does meta file exist? Yes. Is meta file "fresh"? No. Does a tmp file exist? Yes, someone else is "refreshing" it. Is temp file less than x seconds old? Yes. Serve "stale" content. This avoids the "thundering herd" to the backend server/database/whatever handler. Trust me, it works :) -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
> We have been down this road. The way one might solve it is to allow > mod_cache to be able to reload an object while serving the "old" one. > > Example: > > cache /A for 600 seconds > > after 500 seconds, request /A with special header (or from special client, > etc) and cache does not serve from cache, but rather pretends the cache has > expired. do normal refresh stuff. > > The cache will continue to server /A even though it is refreshing it > As Graham suggested, such mechanism will not refresh the pages those are non-popular but expensive to load. which could incur lot of overhead. But, other than that, This looks really good solution. > > Also, one of the flaws of mod_disk_cache (at least the version I am looking > at) is that it deletes objects before reloading them. It is better for many > reasons to only replace them. That's the best way to accomplish what I > described above. If we implement it the way you suggested, then this problem would automatically be solved. -Parin.
Re: mod-cache-requestor plan
On 7/11/05 11:48 PM, "Parin Shah" <[EMAIL PROTECTED]> wrote: should be > refreshed. but, if we consider keeping non-popular but expensive pages in > the cache, in that case houw would te mod-c-requester would make the > decision? > We have been down this road. The way one might solve it is to allow mod_cache to be able to reload an object while serving the "old" one. Example: cache /A for 600 seconds after 500 seconds, request /A with special header (or from special client, etc) and cache does not serve from cache, but rather pretends the cache has expired. do normal refresh stuff. The cache will continue to server /A even though it is refreshing it Make any sense? This way, a simple cron job can be used to refresh desired objects. Also, one of the flaws of mod_disk_cache (at least the version I am looking at) is that it deletes objects before reloading them. It is better for many reasons to only replace them. That's the best way to accomplish what I described above. -- Brian Akins Lead Systems Engineer CNN Internet Technologies
Re: mod-cache-requestor plan
> I believe the basic idea of forwarding multiple requests on the back end can be a very good idea, but needs some bounds as Graham suggests.> .. its an interesting thought. But after Graham's opinion, I am not too sure about performance improvement/overload incured by threads ratio. if we could gain significant performance improvement (whouch would be the when server is lightly loaded), then it is worth to go for multiple sub-requests with some upper bound.-Parin
Re: mod-cache-requestor plan
> - Cache freshness of an URL is checked on each hit to the URL. This runs> the risk of allowing non-popular (but possibly expensive) URLs to expire > without the chance to be refreshed.> > - Cache freshness is checked in an independant thread, which monitors the> cached URLs for freshness at predetermined intervals, and updates them > automatically and independantly of the frontend.> > Either way, it would be useful for mod_cache_requester to operate > independantly of the cache serving requests, so that "cache freshening"> doesn't slow down the frontend. > > I would vote for the second option - a "cache spider" that keeps it fresh. > - In this case, what would be the criteria to determine which pages should be refreshed and which should be left out. intitially I thought that all the pages - those are about to expire and have been requested - should be refreshed. but, if we consider keeping non-popular but expensive pages in the cache, in that case houw would te mod-c-requester would make the decision?> Once mod_cache_requester has decided that a URL needs to be "freshened", > all it needs to do is to make a subrequest to that URL setting the> relevant Cache-Control headers to tell it to refresh the cache, and let > the normal caching mechanism take it's course.- hmm. this seems to be the most elegant solution. > mod_cache_requester would probably be a submodule of mod_cache, using> mod_cache provided hooks to query elements in the cache. - considering that mod-cache-requester would be using some mod-cache's hooks to query the elements in the cache, would mod-cache-requester be still highly dependent on the platform (process vs threads)?Thanks a lot for all this valuable information, Graham.Parin.
Re: mod-cache-requestor plan
Hi all I basically agree with Graham, with just one observation on multi-threaded subrequests. I believe the basic idea of forwarding multiple requests on the back end can be a very good idea, but needs some bounds as Graham suggests. In my opinion you can define a mod_cache_requester connection pool to the back end server which could be pool limited in order to avoid back end saturation. Using this approach you could desing a mod_cache_requester in such a way: - use a priority queue to keep track of needed caching refresh requests and scheduled time for caching refresh (this latter data is the sorting method for the priority) - each time an URL is requested for the first time you should cache request data (in addition to response header) and fill the priority queue with required data - each mod_cache_requester thread can read from the queue one URL and pass the request (stored previously) to the back end In such a way you can realize an "optimized" requester. What do you think of it? Sergio > Da: "Graham Leggett" > > Parin Shah said: > > > When the page expires from the cache, it is removed from cache and > > thus next request has to wait until that page is reloaded by the > > back-end server. > > This is not strictly true - when a page expires from the cache, a > conditional request is sent to the backend server, and if a fresher > version is available it is updated, otherwise the existing cache contents > are left alone. Place was left in the original cache design for serving > multiple requests of the same non-fresh URL without fetching the backend > URL many times, but this has not yet been implemented. > > The option to guarantee freshness of the cache is a very useful feature > though. > > > Here is the overview of how am I planning to implement it. > > > > 1. when a page is requested and it exists in the cache, mod_cache > > checks the expiry time of the page. > > > > 2. If (expiry time current time) < Some_Constant_Value, > > then mod-cache notifies mod_cache_requester about this page. > > This communication between mod_cache and mod_cache_requester should > > incur least overhead as this would affect current request's response > > time. > > There are two approaches to this: > > - Cache freshness of an URL is checked on each hit to the URL. This runs > the risk of allowing non-popular (but possibly expensive) URLs to expire > without the chance to be refreshed. > > - Cache freshness is checked in an independant thread, which monitors the > cached URLs for freshness at predetermined intervals, and updates them > automatically and independantly of the frontend. > > Either way, it would be useful for mod_cache_requester to operate > independantly of the cache serving requests, so that "cache freshening" > doesn't slow down the frontend. > > I would vote for the second option - a "cache spider" that keeps it fresh. > > > 3. mod_cache_requester will re-request the page which is soon-to-expire. > > Each such request is done through separate thread so that multiple > > pages could be re-requested simultaneously. > > Once mod_cache_requester has decided that a URL needs to be "freshened", > all it needs to do is to make a subrequest to that URL setting the > relevant Cache-Control headers to tell it to refresh the cache, and let > the normal caching mechanism take it's course. > > Putting the subrequests into separate threads isn't necessarily a good > idea, as you don't want to put a sudden simultaneous load onto the backend > server, or take up too much processing power of the frontend itself. You > also probably want to keep things simple. > > > This request would force the server to reload the content of the page > > into the cache even if it is already there. (this would reset the > > expiry time of the page and thus it would be able to stay in the cache > > for longer duration.) > > The cache code should already do this. > > > Please let me know what you think about this module. Also I have some > > questions and your help would be really useful. > > > > 1.what would be the best way for communication between mod_cache and > > mod_cache_requester. I believe that keeping mod_cache_requester in a > > separate thread would be the best way. > > mod_cache_requester will need access to the backend caches so that it can > query freshness. This is done through hooks made available for mod_cache > to do the same thing. > > Firing off a separate thread/process for mod_cache_requester can be done > when the server starts up and the module is initialised, however keep in > mind some of the limitations of threads and processes: > > - If the platform supports threads, then you can monitor the disk cache, > the memory cache, and the shared memory cache. > - If the platform supports processes, then you can monitor the disk cache > and shared memory cache only. > > > 2.How should the mod_cache_requester send the re-request to the main > > server. > > You fire off a su
Re: mod-cache-requestor plan
Parin Shah said: > When the page expires from the cache, it is removed from cache and > thus next request has to wait until that page is reloaded by the > back-end server. This is not strictly true - when a page expires from the cache, a conditional request is sent to the backend server, and if a fresher version is available it is updated, otherwise the existing cache contents are left alone. Place was left in the original cache design for serving multiple requests of the same non-fresh URL without fetching the backend URL many times, but this has not yet been implemented. The option to guarantee freshness of the cache is a very useful feature though. > Here is the overview of how am I planning to implement it. > > 1. when a page is requested and it exists in the cache, mod_cache > checks the expiry time of the page. > > 2. If (expiry time current time) < Some_Constant_Value, > then mod-cache notifies mod_cache_requester about this page. > This communication between mod_cache and mod_cache_requester should > incur least overhead as this would affect current request's response > time. There are two approaches to this: - Cache freshness of an URL is checked on each hit to the URL. This runs the risk of allowing non-popular (but possibly expensive) URLs to expire without the chance to be refreshed. - Cache freshness is checked in an independant thread, which monitors the cached URLs for freshness at predetermined intervals, and updates them automatically and independantly of the frontend. Either way, it would be useful for mod_cache_requester to operate independantly of the cache serving requests, so that "cache freshening" doesn't slow down the frontend. I would vote for the second option - a "cache spider" that keeps it fresh. > 3. mod_cache_requester will re-request the page which is soon-to-expire. > Each such request is done through separate thread so that multiple > pages could be re-requested simultaneously. Once mod_cache_requester has decided that a URL needs to be "freshened", all it needs to do is to make a subrequest to that URL setting the relevant Cache-Control headers to tell it to refresh the cache, and let the normal caching mechanism take it's course. Putting the subrequests into separate threads isn't necessarily a good idea, as you don't want to put a sudden simultaneous load onto the backend server, or take up too much processing power of the frontend itself. You also probably want to keep things simple. > This request would force the server to reload the content of the page > into the cache even if it is already there. (this would reset the > expiry time of the page and thus it would be able to stay in the cache > for longer duration.) The cache code should already do this. > Please let me know what you think about this module. Also I have some > questions and your help would be really useful. > > 1.what would be the best way for communication between mod_cache and > mod_cache_requester. I believe that keeping mod_cache_requester in a > separate thread would be the best way. mod_cache_requester will need access to the backend caches so that it can query freshness. This is done through hooks made available for mod_cache to do the same thing. Firing off a separate thread/process for mod_cache_requester can be done when the server starts up and the module is initialised, however keep in mind some of the limitations of threads and processes: - If the platform supports threads, then you can monitor the disk cache, the memory cache, and the shared memory cache. - If the platform supports processes, then you can monitor the disk cache and shared memory cache only. > 2.How should the mod_cache_requester send the re-request to the main > server. You fire off a subrequest to an URL, and throw away the data that comes back. For some example code, look at mod_include. > 3.Other than these questions, any suggestion/correction is welcome. > Any pointers to the details of related modules( mod-cache, > communication between mod-cache and backend server) would be helpful > too. Keep in mind that mod_cache is a framework, into which sub-modules are plugged to do the work of the backend caching. mod_cache_requester would probably be a submodule of mod_cache, using mod_cache provided hooks to query elements in the cache. Regards, Graham --
Re: mod-cache-requestor plan
Hi Parin I'm a newbie too and I was preparing a clear_cache command in order to force cleaning of some desired path. Your idea is very good. My personal considerations are: 1 - you have to avoid multiple requests to the back end: if the page is very frequently clicked your mechanism can bring the system to perform many regeneration requests to the back end, so maybe the best is to use a queue to purge duplicated requests 2 - if you already know the expiry time of the pages (when the page is inserted into the queue you know) you could "book" the request for the regeneration instead of waiting for a request close to the expiration 3 - it could be useful to perform regeneration just of "highly clicked" URLs, introducing a heuristic to define what's a "highly clicked" URL in way similar to swapped-out pages from memory in virtual memory management Bye Sergio > Da: Parin Shah <[EMAIL PROTECTED]> > Data: Sun, 10 Jul 2005 23:24:10 -0500 > A: dev@httpd.apache.org > Oggetto: mod-cache-requestor plan > > Hi All, > > I am a newbie. I am going to work on mod-cache and a new module > mod-cache-requester as a part of Soc program. > > Small description of the module is as follows. > > When the page expires from the cache, it is removed from cache and > thus next request has to wait until that page is reloaded by the > back-end server. But if we add one more module which re-request the > soon-to-expire pages, in that case such pages wont be removed from the > cache and thus would reduce the response time. > > Here is the overview of how am I planning to implement it. > > 1. when a page is requested and it exists in the cache, mod_cache > checks the expiry time of the page. > > 2. If (expiry time current time) < Some_Constant_Value, > then mod-cache notifies mod_cache_requester about this page. > This communication between mod_cache and mod_cache_requester should > incur least overhead as this would affect current request's response > time. > > 3. mod_cache_requester will re-request the page which is soon-to-expire. > Each such request is done through separate thread so that multiple > pages could be re-requested simultaneously. > > This request would force the server to reload the content of the page > into the cache even if it is already there. (this would reset the > expiry time of the page and thus it would be able to stay in the cache > for longer duration.) > > Please let me know what you think about this module. Also I have some > questions and your help would be really useful. > > 1.what would be the best way for communication between mod_cache and > mod_cache_requester. I believe that keeping mod_cache_requester in a > separate thread would be the best way. > > 2.How should the mod_cache_requester send the re-request to the main > server. I believe that sending it as if the request has come from the > some client would be the best way to implement. But we need to attach > some special status with this request so that cache_lookup is > bypassed and output_filter is not added as we dont need to stream the > output. > > 3.Other than these questions, any suggestion/correction is welcome. > Any pointers to the details of related modules( mod-cache, > communication between mod-cache and backend server) would be helpful > too. > > Thanks, > Parin. >