Hi, quite a few times I have ran into problems where I would like to read data from a data source too huge to keep in memory. Not only this, but retrieving the data from that source is a bit slow. I wish to work with only small chunks of the data in memory at any time.
Now, sometimes the same data is retrived multiple times calling for an in-RAM caching mechanism. To make it simple, consider only the case where we read data, that is, no writing/updating. First of all, the file cache (and the swap) of the OS is not good enough; the data is spread out on different files and within each file the order of the data is such that we can do better than the file cache. The data structures that I'm targetting are simple vectors (including lists), but also matrices (where each column is one data element; I think it's more optimal to work with columns because of the way the matrices are represented in memory). I've tried to "scribble" down some thoughts about how such a cache algorithm can be designed. Before I get too serious about this, I would like to check if someone else have done a similar thing or know of references to papers/algorithms that implement this. Here are my notes/outline. I would appreciate any kind of feedback. Setup: A large dataset lives on a persistent and slow media. The data is too big to keep in memory forcing us to access the data from the persistent media. The data elements are indexed by i=1,2,...,I. The same data elements might be accessed multiple times. Using a cache we wish to minimize the number of times a random elements is read of the persistent media. Requirement: Thus, the cache, which can hold C elements, have to keep track of what elements it holds. We will use a caching strategy where we assume that the latest read element is most likely element to be read again. Thus, we wish to have a cache that keeps most recent elements and pushes out old elements. Design: The indices of the data elements currently in the cache are stored in vector J of length C, i.e. J = (J_1, ..., J_C), where J_c \in [1,I], c = 1,...,C. The elements in J are unique, that is J_k != J_l for all k != l. Assume D data elements are request to be read. Let these D elements be specified by vector K = (K_1, ..., K_D) where K_d \in [1,I]. Note that some of these elements may be duplicated. API: Say we want to read indices K. Without cache we would do something like: y <- readData(db, K) with cache y <- readDataViaCache(db, K) where 'db' is the "database". The reference to the actual cache is known by the 'db' object. Algorithm: * Create the return structure. Assume a vector for simplicity. x <- vector(length=D) * Identify cache hits, that is what data elements are already in the cache. Pull out that data from the cache and the rest from the main data source. This can be done as follows: c <- match(K, J) # Gets the cache position or NA inCache <- !is.na(c) x[inCache] <- cacheData(J[c[inCache]]) x[!inCache] <- slowData(K[!inCache]) Three cases to worry/think about: Case 1) all(inCache) is TRUE Case 2) any(inCache) is FALSE Case 3) otherwise I leave this for now, but the below can probably optimized depending which case we have. * Update the cache. We have to be careful not to store duplicated elements in the cache. We also want to keep track of how "recent" a certain data element is. One way to do this is to order the cache by age, but since we want to minimize the amount of reshuffling in the cache, it is better to keep track of the age in a separate vector t = (t_1, ..., t_C). The age of the requested data elements are denoted by u = (u_1, ..., u_D) such that u_D = -D, u_{D-1} = -(D-1), ..., u_1 = -1. Thus, the last element requested is the most recent one (because its age value is smaller than any other age values). The age of the elements in the cache are all t_c > 0. This makes it easy to differentiate u_d and t_c. Update the age of the elements in the cache that was read from the cache. Do it according to their age in the requested vector, that is: t[c[inCache]] <- u[inCache] There are E = sum(!inCache) elements that were read from the main data source and that are not in the cache. These elments indexed by L=K[!inCache] have ages v = u[!inCache]. Note that we might have duplicated elements here. Me might want to remove duplicates already here by keeping the most recent duplicated element; nondup <- rev(!duplicated(rev(v))) v <- v[nondup] L <- L[nondup] Q <- which(!inCache)[nondup] # The k indices of these We have to worry about the mapping back later (which I actually don't think is a problem). Next, identify the set of these that are more recent that the ones in the cache t. To do this: w <- sort(c(v, t)) # partial sorting up to C will be enough! wMax <- w[C] # This is max age that "fits" the cache toCache <- (v <= wMax) Q <- Q[toCache] Note that this identifies M=sum(toCache) elements that should be put in the cache. Three cases exists: Case M == 0: The cache don't have to be updated. Just update t and we're done; # TO DO: How should t be updated?!? ... t <- order(t) - (C+1) Case M == C: The complete cache have to be replaced. The values to put in the cache are: cacheData <- x[Q] J <- K[Q] t <- v[toCache] t <- order(t) - (C+1) Case M < C: Only parts of the cache have to be replaced. notToUpdate <- c[inCache] toUpdate <- setdiff(1:C, notToUpdate) cacheData[toUpdate] <- x[Q] J[toUpdate] <- K[Q] t[toUpdate] <- v[toCache] t <- order(t) - (C+1) Cheers /Henrik ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel