Caching search results
I've got a mod_perl application that's using swish-e. A query from swish may return hundreds of results, but I only display them 20 at a time. There's currently no session control on this application, and so when the client asks for the next page (or to jump to page number 12, for example), I have to run the original query again, and then extract out just the results for the page the client wants to see. Seems like some basic design problems there. Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, in the sort term, I was thinking about caching search results (which is just a sorted list of file names) using a simple file-system db -- that is, (carefully) build file names out of the queries and writing them to some directory tree . Then I'd use cron to purge LRU files every so often. I think this approach will work fine and instead of a dbm or rdbms approach. So I asking for some advice: - Is there a better way to do this? - There was some discussion about performance and how many files to put in each directory in the past. Are there some commonly accepted numbers for this? - For file names does it make sense to use a MD5 hash of the query string? It would be nice to get an even distribution of files in each directory. - Can someone offern any help with the locking issues? I was hoping to avoid shared locking during reading -- but maybe I'm worrying too much about the time it takes to ask for a shared lock when reading. I could wait a second for the shared lock and if I don't' get it I'll run the query again. But it seems like if one process creates the file and begins to write without LOCK_EX and then gets blocked, then other processes might not see the entire file when reading. Would it be better to avoid the locks and instead use a temp file when creating and then do an (atomic?) rename? Thanks very much, Bill Moseley mailto:[EMAIL PROTECTED]
Re: Caching search results
Bill Moseley wrote: Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, in the sort term, I was thinking about caching search results (which is just a sorted list of file names) using a simple file-system db -- that is, (carefully) build file names out of the queries and writing them to some directory tree . Then I'd use cron to purge LRU files every so often. I think this approach will work fine and instead of a dbm or rdbms approach. Always start with CPAN. Try Tie::FileLRUCache or File::Cache for starters. A dbm would be fine too, but more trouble to purge old entries from. - Perrin
Re: Caching search results
At 10:10 AM 1/8/01 -0800, you wrote: Bill Moseley wrote: Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, in the sort term, I was thinking about caching search results (which is just a sorted list of file names) using a simple file-system db -- that is, (carefully) build file names out of the queries and writing them to some directory tree . Then I'd use cron to purge LRU files every so often. I think this approach will work fine and instead of a dbm or rdbms approach. Always start with CPAN. Try Tie::FileLRUCache or File::Cache for starters. A dbm would be fine too, but more trouble to purge old entries from. an RDBMS is not much more trouble to purge, if you have a time-of-last-update field. And if you're ever going to access your cache from multiple servers, you definitely don't want to deal with locking issues for DBM and filesystem based solutions ;=( -Simon - Simon Rosenthal ([EMAIL PROTECTED]) Web Systems Architect Northern Light Technology One Athenaeum Street. Suite 1700, Cambridge, MA 02142 Phone: (617)621-5296 : URL: http://www.northernlight.com "Northern Light - Just what you've been searching for"
Re: Caching search results
On Mon, 8 Jan 2001, Perrin Harkins wrote: Bill Moseley wrote: Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, in the sort term, I was thinking about caching search results (which is just a sorted list of file names) using a simple file-system db -- that is, (carefully) build file names out of the queries and writing them to some directory tree . Then I'd use cron to purge LRU files every so often. I think this approach will work fine and instead of a dbm or rdbms approach. Always start with CPAN. Try Tie::FileLRUCache or File::Cache for starters. A dbm would be fine too, but more trouble to purge old entries from. You could always have a second dbm file that can keep track of TTL issues of your data keys, so it would simply be a series of delete calls. Granted you would have another DBM file to maintain. -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/
Re: Caching search results
At 02:02 PM 1/8/01 -0800, Sander van Zoest wrote: On Mon, 8 Jan 2001, Simon Rosenthal wrote: an RDBMS is not much more trouble to purge, if you have a time-of-last-update field. And if you're ever going to access your cache from multiple servers, you definitely don't want to deal with locking issues for DBM and filesystem based solutions ;=( RDBMS does bring replication and backup issues. The DBM and FS solutions definately have their advantages. It would not be too difficult to write a serialized daemon that makes request over the net to a DBM file. What in you experience makes you pick the overhead of an RDBMS for a simple cache in favor of DBM, FS solutions? We cache user session state (basically using Apache::Session) in a small (maybe 500K records) mysql database , which is accessed by multiple web servers. We made an explicit decision NOT to replicate or backup this database - it's very dynamic, and the only user visible consequence of a loss of the database would be an unexpected login screen - we felt this was a tradeoff we could live with. We have a hot spare mysql instance which can be brought into service immediately, if required. I couldn't see writing a daemon as you suggested offering us any benefits under those circumstances, given that RDBMS access is built into Apache::Session. I would not be as cavalier as this if we were doing anything more than using the RDBMS as a fast cache. With decent hardware (which we have - Sun Enterprise servers with nice fast disks and enough memory) the typical record retrieval time is around 10ms, which even if slow compared to a local FS access is plenty fast enough in the context of the processing we do for dynamic pages. Hope this answers your question. -Simon -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/ - Simon Rosenthal ([EMAIL PROTECTED]) Web Systems Architect Northern Light Technology One Athenaeum Street. Suite 1700, Cambridge, MA 02142 Phone: (617)621-5296 : URL: http://www.northernlight.com "Northern Light - Just what you've been searching for"
Re: Caching search results
On Mon, 8 Jan 2001, Simon Rosenthal wrote: an RDBMS is not much more trouble to purge, if you have a time-of-last-update field. And if you're ever going to access your cache from multiple servers, you definitely don't want to deal with locking issues for DBM and filesystem based solutions ;=( RDBMS does bring replication and backup issues. The DBM and FS solutions definately have their advantages. It would not be too difficult to write a serialized daemon that makes request over the net to a DBM file. What in you experience makes you pick the overhead of an RDBMS for a simple cache in favor of DBM, FS solutions? -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/
Re: Caching search results
On Mon, Jan 08, 2001 at 10:10:25AM -0800, Perrin Harkins wrote: Always start with CPAN. Try Tie::FileLRUCache or File::Cache for starters. A dbm would be fine too, but more trouble to purge old entries from. If you find that File::Cache works for you, then you may also want to check out the simplified and improved version in the Avacet code, which additionally offers a unified service model for mod_perl applications. Services are available for templates (either Embperl or Template Toolkit), XML-based configuratio, object caching, connecting to the Avacet application engine, standardized error handling, dynamically dispatching requests to modules, and many other things. -DeWitt
Re: Caching search results
On Mon, 8 Jan 2001, Sander van Zoest wrote: starters. A dbm would be fine too, but more trouble to purge old entries from. You could always have a second dbm file that can keep track of TTL issues of your data keys, so it would simply be a series of delete calls. Granted you would have another DBM file to maintain. I find it kind of painful to trim dbm files, because most implementations don't relinquish disk space when you delete entries. You end up having to actually make a new dbm file with the "good" contents copied over to it in order to slim it down. - Perrin
Re: Caching search results
On Mon, 8 Jan 2001, Simon Rosenthal wrote: I couldn't see writing a daemon as you suggested offering us any benefits under those circumstances, given that RDBMS access is built into Apache::Session. No, in your case I do not see a reason behind it either. ;-) Again this shows that it all depends on the requirements and things you are willing to sacrafice. Cheers, -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/
Re: Caching search results
On Mon, 8 Jan 2001, Perrin Harkins wrote: On Mon, 8 Jan 2001, Sander van Zoest wrote: starters. A dbm would be fine too, but more trouble to purge old entries from. You could always have a second dbm file that can keep track of TTL issues of your data keys, so it would simply be a series of delete calls. Granted you would have another DBM file to maintain. I find it kind of painful to trim dbm files, because most implementations don't relinquish disk space when you delete entries. You end up having to actually make a new dbm file with the "good" contents copied over to it in order to slim it down. Yeah, this is true. Some DBMs have special routines to fix these issues. You could use the gdbm_reorganize call to clean up those issues for example (if you are using gdbm that is) Just some quick pseudo code (don't have a quick example ready here): use GDBM_File; my $gdbm = tie %hash, 'GDBM_File', 'file.gdbm' GDBM_WRCREAT|GDBM_FAST, 0640 or die "$!"; $gdbm-reorganize; That definately helps a lot. -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/
Re: Caching search results
Hi Guys, On Mon, 8 Jan 2001, Sander van Zoest wrote: On Mon, 8 Jan 2001, Perrin Harkins wrote: On Mon, 8 Jan 2001, Sander van Zoest wrote: At the risk of getting shot down in flames again, do you think you could take this off-list guys? I can't seem to delete the messages as fast as they're coming in... :) 73, Ged.
Re: Caching search results
On Mon, 8 Jan 2001, G.W. Haywood wrote: At the risk of getting shot down in flames again, do you think you could take this off-list guys? I guess this could be moved to the scalable list ([EMAIL PROTECTED]), or in private since this isn't really on the topic of modperl anymore. Cheers, -- Sander van Zoest [[EMAIL PROTECTED]] Covalent Technologies, Inc. http://www.covalent.net/ (415) 536-5218 http://www.vanzoest.com/sander/