Nathan Nobbe, < <what i dont understand is when the data profile of a particular table changes, and the variable in the memcache that represents that << table becomes stale, how can the variable be updated efficiently? This post is rather lengthy, I will answer your question directly first, then provide details below, so you don't have to read the whole thing unless you want to. First, let me warn you not to use triggers for this, they will hammer your performance. Some database systems, like Oracle, write all changes to log files that can be read outside of running queries against the database. You can have an object in a separate thread in a background process which reads the log files. Or, you can put operating system level events on the log files and implement event handlers so that your cache managers are notified when the log file changes, this is how I do it for Oracle. The other option is to poll the database on some period using queries. Polling will only work if you take steps to cut down what has to be looked at, and if the polling query only gets the data which changed and nothing else. For this to be efficient you have to have a way to first get the primary keys of what changed, then run a second query to get the changes. The query which gets the primary keys has to be simple. I describe what I think is required to write such an intelligent polling system below. I call this intelligent caching and cache forwarding. Please note, that even when using cache forwarding, there still must exist at the point where the cache touches the database a process which is examining the database (polling) looking for changes. The key is to make this polling mechanism very efficient. The cache managers have to run in separate threads in a background process. There are a couple of concepts that help with managing caches. The way I do it is I use the cache as a buffering mechanism so that application code never goes directly to the RDBMS, it always goes to the cache. I have cache managers which run in a separate thread in a background process that synch the caches wih the RDBMS and with other caches. The cache managers add an overhead to the entire system but the end user does not feel this overhead because the end user is never waiting on the RDBMS, the end user always goes directly to the cache. Building a caching system which ensures that the data is only stale within acceptable limits is not a trivial task. It requires modeling the characteristics of the data at a level that no modeling tool that I know of supports. You then have to build processors that can move the data around depending on the characteristics, liveness being the most important characteristic. The characteristics which are important are: liveness, concurrency, distribution, notification, and mutability. Describing these characteristics for each data element makes it possible to build very efficient polling mechanism because the poller will not have to look at everytihng in the database and will not have to move everything everywhere. Liveness describes how stale the data can be and still be acceptable. Liveness is expressed in milliseconds, A value of 0 means the data has to be realtime. By stale I mean that changes to the cache have not yet been committed to the underlying database, or changes to the underlying database have not yet been reflected in this cache. I call the process of reflecting changes in the database with the cache synchronization. Synchronization also describes updating one cache with the changes in another cache. I have seen very few data elements which had a liveness of 0 (meaning the data had to be realtime). Concurrency describes the likelihood that two or more users are updating this specific data at the same time. Concurrency applies to both the table level, the row level, and the column level. I added the column level after seeing many cases where for a given row only certain columns were every updated. Concurrency can be further described by listing the user roles which are likely to be concurrent with each other or for which users in that role are likely to be concurrant with each other. Let's say that you have a bank account and you are updating your home address online. The likelihood that another user is updating your home address is zero (well, it better be zero or the bank has a big problem). The likelihood that a bank employee is updating your home address is small, but not zero. And, I don't have to worry about all bank employees because the bank employees authorized to make this change will be playing certain roles. The likelihood that another person who is authorized to use your account is updating your home address is also small, but not zero. Knowing not just that concurrency is likely, but also which users are likely to be performing actions that are concurrent with each other opens up some interesting possibilities. If I have marked a table as being concurrent and then I define the roles which are concurrent, I can show the users which have these roles what all the users in their role are doing in the cases where there are potential concurrency conflicts (deadlocks). I can also design my applications so that changes made by one user can be broadcast to another user without needing to broadcast these changes to everyone. Back to the bank example, I only have to worry about concurrency if certain bank employees are online or if a user authorized to use your account is online. Distribution describes the necessity of pushing the contents of one cache to caches in other tiers in the N-tier architecture. Distribution also describes the tiers, what caches or persistent stores live on each tier, and how the caches and persistent stores relate to each other. Notification answers the question, "if something does change, when will a concurrent user care"? The answer is either while committing updates to data to the underlying database, or while viewing data. I have to use an example to explain this. Say you are on a web page where results of a query are being displayed. This web page is not realtime because at no point can you claim that what is on the web page is a 100% reflection of what is in the database. Now, let's say that you update one row of data on the web page. You will care if that same row was updated previously by another user when commiting your changes. Given two users A and B, if A runs a query, then B comes along and updates one of the rows A is looking at and commits the changes, then A updates that same row, A will care about B's update at the time that A commits the A's changes. In this case what I do is display a page which says "...saving this data will overwrite changes made by B", I then show the original data, the changes made by B, and the changes made by A and ask A what to do, the options are - revert to original, keep Bs changes and discard As changes, or keep As changes and discard Bs changes. In this case A cared about Bs updates when A was committing updates to A's changes. A database level transaction does not help here because no programmer in their right mind would begin a transaction on a query because the user MIGHT do an update at some undetermined time in the future. You would be surprised how many people will say that starting a transaction when A executes the first query is the answer to this problem. The second case, that a user cares about changes to the underlying data while viewing the data is far more problematic. Let's say that your job is to ship products that sales people are selling on the phone. A salesperson creates an order, you ship the order. In front of you is a screen showing all the orders that need to be shipped. Let's say further that this product is very time sensitive, not shipping a placed order is a big deal, let's say it will cost you your job. In this case you care if the Order table is updated at the time you are viewing the Order data. So, the screen you are looking at needs to be refreshed as updates to the underlying Order table occur. In this case the programmer has no choice but to use a push model. Ouch. I have seen this case in logistics management systems and order fulfillment systems. But, I think this is the exception rather than the rule. In all my years of building business applications I have encountered data which truly had to be live very rarely. Usually a lag time is acceptable. The length of the lag time depends on the business processes that the data is used in. Mutability describes the lifecycle of the data. Mutability applies to both tables and rows. Some tables once populated never change. Some tables are only changed by inserts, once inserted the row never changes. Some tables have both inserts and row updates. Mutability describes not just what kind of changes are made to tables and rows, but how often the changes occur and how likely the changes are to occur. Knowing this information is crucial in building an intelligent caching system. Based on having the characteristics of the underlying database's tables fully defined it is quite easy to build an intelligent cache system based on an efficient polling mechanism. I use a cache forwarding mechanism. I always have a cache at each tier in the N-tier architecture. There is one cache on the same machine where the RDBMS lives, this cache is always a copy of what is in the database. A cache manager runs queries in a separate thread in a background process looking for changes. The queries are always ran against a special timestamp table I created to find what changed, then if there are any changes, a second query is run to actually get the changes. This cache manager also updates the RDBMS with changes pushed to this cache. There are also caches at the app server tier and the web server tier. For each cache, a cache manager runs in the background which ensures that all changes to the cache are pushed to all other caches. I also have a cache in the web browser (not memcached, just javascript arrays) which is populated with data that is related to the user. Once a user logs in, and I have the user ID I prepopulate the cache with all rows related to the user ID. For example, on a banking application I worked on, once the customer logged in I used the cuystomer's social security number to select all rows which were related either directly or indirectly to that social security number. This query returned all the customer info plus all the account info for that customer. The cache then fulfills queries without needing to go to the database. Triggers are not a good idea because they will significantly degrade the performance of your database. You can help yourself a lot by the design of your database. Two rules that I always follow in designing a database - 1) all tables have a sequential integer as a primary key, a number that doesn't mean anything to anyone but me. I usually call this column tablename_id; and, 2) I always have an intersection table named timestamp_table_row_user which has five columns - table name, primary key, username, timestamp, and timestamp_action. Timestamp_action can be one of update, insert, or delete. I also create an archive table for this table and periodically sweep everything over a certain age into the archive table. Some people prefer to add a timestamp column to existing tables, I don't like this idea because you need multiple timestamp columns, one for insert, update, delete and you cannot re-create a change history because the update timestamp column gets overwritten everytime. Also, I don't like having to modify every table schema with what is housekeeping meta-data. Using an intersection table and custom drivers makes it possible to add this approach to existing databases and existing applications without changing them. Using the timestamp table makes it very simple to determine what has changed in the database and more importantly how it changed. Instead of using triggers to modify the timestamp table I wrote my own jdbc drivers which take any SQL statement being executed and create a second SQL statement which makes an entry into the timestamp table to describe what the original SQL statement did. Using sequential integers for primary keys helps with identifying new rows added to a table with a simpler mechanism than a timestamp. All you have to do is obtain max(primarykey), store it, then compare that to the next max(primarykey). Getting the new rows is simple, just select * where primarykey > the first max(primarykey), or you can build a more specific query by using the IN clause, as in SELECT * FROM tablename WHERE primarykey IN (key1, key2, key3...). Or you can use BETWEEN and give the stored max(primarykey) + 1 and the latest max(primarykey). Obtaining updated rows requires a timestamp. I hope this helps, feel free to ask any questions. Steve Donelow President Object Works, Inc.
************************************************************************* This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies. *************************************************************************
