On 10/13/09 15:19, James Carlson wrote: > Lori Alt wrote: > >> On 10/13/09 13:36, Nicolas Williams wrote: >> >>> Throwing away of cached blocks probably needs to be done synchronously >>> by both ends, or else the receiver has to at least keep an index of >>> block checksum to block pointer for all previously seen blocks in the >>> stream. Synchronizing the caches may require additional records in the >>> stream. But I agree with you: it should be possible to bound the memory >>> usage of zfs send dedup. >>> >>> >> Yes, the memory usage can be bounded. It was our plan at this time >> however to regard that as an implementation detail, not part of the >> interface to be approved by this case. >> > > It becomes part of the interface if (a) the sender needs to notify the > recipient of table flushes (as Nico reasonably suggested) or potentially > (b) it becomes part of the usage considerations for users. There's > actually a good bit of prior art to draw on here from other stream > compression schemes. > > I missed Nico's suggestion about notification of the recipient for cache flushes. Actually, there is no need for a cache on the receive side. Or more exactly, the dataset hierarchy constructed by the receive IS the cache. The new write-by-reference record in the send stream essentially sends this information:
* identification of where the data can be found already on the target system (i.e. the object set, the object, and the offset and length within the object) * the location where the data is to be written (object set, object, and offset). During the receive, all datasets being received are "held" and not deletable until the receive completes, so the data is guaranteed to be present. There is no need to maintain an index of block checksum to block pointer on the receive side. There IS a need to maintain this on the send side, which is where memory management is an issue. As for the send-side memory management, I agree that we could establish a public interface by which a caller can constrain the memory to be used. However, we were thinking that if such an interface turns out to be necessary, we could define it and add it later, once we gain more experience with how over-the-wire dedup gets used in practice. I don't know whether the kinds of on-the-fly compression disabling that James mentions are relevant for dedup'ing. For example, in one of my test cases, which is a hierarchy of datasets that contain Solaris development workspaces, you can go for a long time without finding more than a handful of duplicate blocks, but once you've finished with one development workspace and started on the next one, then you start getting lots of duplicates because now you're seeing identical copies of the files you processed in the first dataset. This is just one kind of data, but in general, it's hard to predict at what point in the stream you're going to start getting dedup'ing bang for your memory-hogging buck. Lori