RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either
RE: The design of the eviction improvement
I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader
Re: The design of the eviction improvement
Hi, - Allen Samuels allen.samu...@sandisk.com wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. ++ If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. I'm not clear why that needs to be the case (but don't think it matters just now whether I do, I was just letting folks know that we have MQ implementation(s)), but what you're describing seems consistent the model Sage and Greg, at least, are describing. Matt What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records
RE: The design of the eviction improvement
Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes
RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
Hi Allen, -Original Message- From: Allen Samuels [mailto:allen.samu...@sandisk.com] Sent: Thursday, July 23, 2015 2:41 AM To: Sage Weil; Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. This is true. If we have lots of small objects/files, the memory used for LRU lists could be extremely large. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. The LSM mechanism could solve the memory consumption problem. But I guess the process to choose which objects to evict is complex and inefficient. Also, after evicting some objects, we need to update the on-disk file to remove the entries of these objects. This is inefficient, too. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite
RE: The design of the eviction improvement
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, July 23, 2015 2:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. This looks similar to what we do now, except it keeps a LRU of the object-atime mapping in RAM, instead of using a hitset. The object age calculated using the atime would be more accurate than the current hitset and mtime approach. One comment is that I think if we can find a record of an object in the in-memory LRU list, we don't need to query the DB since the atime in the LRU list is for sure newer than that in the db (if it has). My concern on this approach is whether the evict decision made by the histogram estimation approach is good enough. It only measures 'recency'. And it made the decision based on some threshold, not starting from the oldest. In contrast, most of the practical algorithms made the decision based on both 'recency' and 'frequency' (accessed once recently vs. accessed twice or more recently). If we believe the histogram estimation approach is good enough, I think we can easily integrate the idea above with 2Q. 1) The in-memory LRU lists are the same as 2Q. i.e., there are active/inactive lists, and the movements between the two lists are the same as what I stated in the original mail. But we have a limit of the size of the lists. Only the top N hottest objects are kept in the lists. 2) When the size of the lists exceed N*(1.x), evict the oldest items (N .. N*1.x) to db store. 3) N could be dynamically adjusted based on the average size of objects in the PG. 4) Evict decision are made by the histogram estimation approach. Plus I think we want to evict those objects which are not in the in-memory lists first. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers
RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Allen Samuels wrote: Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index. Iterating over the LSM should be quite cheap. I'd be more worried about the cost of the insertions. I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ). Maybe... On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size. We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism. Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant... sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's
RE: The design of the eviction improvement
Yes the cost of the insertions with the current scheme is probably prohibitive. Wouldn't it approach the same amount of time as just having atime turned on in the file system? My concern about the memory is mostly that we ensure whatever algorithm is selected degrades gracefully when you get high counts of small objects. I agree that paying $ for RAM that translates into actual performance isn't really a problem. It really boils down to your workload and access pattern. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, July 22, 2015 2:53 PM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index. Iterating over the LSM should be quite cheap. I'd be more worried about the cost of the insertions. I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ). Maybe... On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size. We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism. Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant... sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage
Re: The design of the eviction improvement
Thanks for the explanations, Greg. - Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote: Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. We make caching decisions in terms of whole objects right now, yeah. There's really nothing in the system that's capable of doing segments within an object, and it's not just about tracking a little more metadata about dirty objects — the way we handle snapshots, etc would have to be reworked if we were allowing partial-object caching. Plus keep in mind the IO cost of the bookkeeping — it needs to be either consistently persisted to disk or reconstructable from whatever happens to be in the object. That can get expensive really fast. -Greg For current semantics/structure of PGs + specific tier held fixed, makes sense. For our object addressing currently, we have a greater requirement for partial object caching. (Partly, we did this to achieve periodicity w/sequential I/O.) I think broadly, there are large performance tradeoffs here. In AFS and DCE, there is full consistency in materialized caches. Also, caches are dimensioned by chunks. If the cache is materialized in memory, the semantics aren't those of disk. Basically, consistency guarantees are policy. Different snapshot mechanisms, or omtting them, e.g., should logically enable relaxed consistency, modulo policy. Matt Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's
RE: The design of the eviction improvement
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, July 21, 2015 9:29 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Tue, 21 Jul 2015, Wang, Zhiqiang wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? With fixed-sized slots on the slabs, the slab allocator could be very simple.. maybe just a bitmap, a free counter, and any other trivial optimizations to make finding a slab's next free a slot nice and quick. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced
Re: The design of the eviction improvement
Hi, - Zhiqiang Wang zhiqiang.w...@intel.com wrote: Hi Matt, -Original Message- From: Matt W. Benjamin [mailto:m...@cohortfs.com] Sent: Tuesday, July 21, 2015 10:16 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil Subject: Re: The design of the eviction improvement Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. The MQ algorithm is more complex, and seems like it has more overhead than 2Q. The approximate 2Q algorithm here combines some benefits of the clock algorithm, and works well on the linux kernel. MQ could be another choice. There are some other candidates like LIRS, ARC, etc., which have been deployed in some practical systems. MQ has been deployed in practical systems, and is more general. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Though currently the caching decisions are made in the unit of object as Greg pointed out in another mail, I think we still have something to improve here. I'll come back to this some time later. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti eri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag
RE: The design of the eviction improvement
Hi Matt, -Original Message- From: Matt W. Benjamin [mailto:m...@cohortfs.com] Sent: Tuesday, July 21, 2015 10:16 PM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil Subject: Re: The design of the eviction improvement Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. The MQ algorithm is more complex, and seems like it has more overhead than 2Q. The approximate 2Q algorithm here combines some benefits of the clock algorithm, and works well on the linux kernel. MQ could be another choice. There are some other candidates like LIRS, ARC, etc., which have been deployed in some practical systems. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Though currently the caching decisions are made in the unit of object as Greg pointed out in another mail, I think we still have something to improve here. I'll come back to this some time later. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti eri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict
RE: The design of the eviction improvement
On Tue, 21 Jul 2015, Wang, Zhiqiang wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. With fixed-sized slots on the slabs, the slab allocator could be very simple.. maybe just a bitmap, a free counter, and any other trivial optimizations to make finding a slab's next free a slot nice and quick. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag
Re: The design of the eviction improvement
Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so
Re: The design of the eviction improvement
On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote: Hi, Couple of points. 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU implementation with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning (sharding), scan resistance, and coordination w/lookup tables. We might extend/re-use it. 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache segments (are you proposing to/do we now always cache whole objects?), and cost of looking for dirty objects; I suspect that it makes sense to amortize the cost of locating segments eligible to be flushed, rather than minimize bookkeeping. We make caching decisions in terms of whole objects right now, yeah. There's really nothing in the system that's capable of doing segments within an object, and it's not just about tracking a little more metadata about dirty objects — the way we handle snapshots, etc would have to be reworked if we were allowing partial-object caching. Plus keep in mind the IO cost of the bookkeeping — it needs to be either consistently persisted to disk or reconstructable from whatever happens to be in the object. That can get expensive really fast. -Greg Matt - Zhiqiang Wang zhiqiang.w...@intel.com wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one
Re: The design of the eviction improvement
On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so? - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict. As someone else mentioned in a follow-up, the reason we let the dirty level be set lower than the full level is that it provides headroom so that objects can be quickly evicted (delete, no flush) to make room for new writes or new promotions. That said, we probably can/should streamline the flush so that an evict can immediately follow without waiting for the agent to come around again. (I don't think we do that now?) sage ## LRU lists Snapshotting - The two lists are snapshotted persisted periodically. - Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots. Any comments/feedbacks are welcomed. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
RE: The design of the eviction improvement
This seems much better than the current mechanism. Do you have an estimate of the memory consumption of the two lists? (In terms of bytes/object?) Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, July 20, 2015 1:47 AM To: Sage Weil; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: The design of the eviction improvement Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so? - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict. ## LRU lists Snapshotting - The two lists are snapshotted persisted periodically. - Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots. Any comments/feedbacks are welcomed. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
Hi Nick, -Original Message- From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: Monday, July 20, 2015 5:28 PM To: Wang, Zhiqiang; 'Sage Weil'; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement Hi, -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: 20 July 2015 09:47 To: Sage Weil sw...@redhat.com; sj...@redhat.com; ceph- de...@vger.kernel.org Subject: The design of the eviction improvement Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. I really like this idea but just out of interest, there must be a point where the overhead of managing much larger lists of very cold objects starts to impact on the gains of having exactly the right objects in each tier. If 90% of your hot IO is in 10% of the total data, how much extra benefit would you get by tracking all objects vs just tracking the top 10,20,30%...etc and evicting randomly after that? If these objects are being accessed infrequently, the impact of re-promoting is probably minimal and if the promotion code can get to a point where it is being a bit more intelligent about what objects are promoted then this is probably even more so? The idea is that the lists only hold the objects in the cache tier. For those objects which are cold enough, it's evicted from the cache tier and removed from the lists. Also, the lists are maintained at the PG level. I guess the lists won't be too extremely large? In your example of the 90%/10% data access, it may be right that randomly evicting the 90% cold data is good enough. But we need a way to know what the 10% of the hot data are. Also, we can't assume the 90%/10% pattern for every workload. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so? - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict. I think you want to flush at a lower limit because flushes can be more expensive that evictions
RE: The design of the eviction improvement
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, July 21, 2015 6:38 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: Re: The design of the eviction improvement On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri ng_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. Unfortunately I missed this last CDS so I'm behind on the discussion. I have a few questions though... # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. This last sentence seems to me to be the crux of it. Assuming we have an OSD based by flash storing O(n) objects, we need a way to maintain an LRU of O(n) objects in memory. The current hitset-based approach was taken based on the assumption that this wasn't feasible--or at least we didn't know how to implmement such a thing. If it is, or we simply want to stipulate that cache tier OSDs get gobs of RAM to make it possible, then lots of better options become possible... Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry that's 100MB... so not so unreasonable, perhaps! I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-) - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so? - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict. As someone else mentioned in a follow-up, the reason we let the dirty level be set lower than the full level is that it provides headroom so that objects can be quickly evicted (delete, no flush) to make room for new writes or new promotions. That said, we probably can/should streamline the flush so that an evict can immediately follow without waiting for the agent to come around again. (I don't think we do that now?) I was afraid of having to maintain different lists for flush/evict. So I proposed to combine flush with evict. But seems like we don't need to. When reaching