RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Allen Samuels wrote:
 I'm very concerned about designing around the assumption that objects 
 are ~1MB in size. That's probably a good assumption for block and HDFS 
 dominated systems, but likely a very poor assumption about many object 
 and file dominated systems.
 
 If I understand the proposals that have been discussed, each of them 
 assumes in in-memory data structure with an entry per object (the exact 
 size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of 
 graceful degradation as the object counts grow and the in-memory data 
 structures get larger. Everything seems fine until just a few objects 
 get added then the system starts to page and performance drops 
 dramatically (likely) to the point where Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into 
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism 
 ought to be able to accomplish what we want. Some amount of the data 
 structure (the most likely to be used) is held in DRAM [and backed to 
 storage for restart] and the least likely to be used is flushed to 
 storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object - atime.

We keep a simple LRU of the top N objects in memory with the object-atime 
values.  When an object is accessed, it is moved or added to the top 
of the list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb 
in a simple object - atime fashion

When the agent runs, we just walk across that key range of the db the same 
way we currently enumerate objects.  For each record we use either the 
stored atime or the value in the in-memory LRU (it'll need to be 
dual-indexed by both a list and a hash map), whichever is newer.  We can 
use the same histogram estimation approach we do now to determine if the 
object in question is below the flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it 
has to do in check.

sage

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load and
   manage such a list.  Assuming it is several hundred MB, it'll take a
   while to load that into memory and set up all the pointers (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to say,
  we have an active/inactive list for every PG. We can load the lists in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and have
   the next/previous pointers be a (slab,position) pair.  That way we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead of
   the full object name.  We could continue to do the same, or we could
   do a hash pair (hobject_t hash + a different hash of the rest of the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's.  A
   link could be a slab + position (2 more u32's), and if we have prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have the
  same hobject hash. Thus we have to use the hash pair instead of just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say, is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I think 
 we can either

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
I'm very concerned about designing around the assumption that objects are ~1MB 
in size. That's probably a good assumption for block and HDFS dominated 
systems, but likely a very poor assumption about many object and file dominated 
systems.

If I understand the proposals that have been discussed, each of them assumes in 
in-memory data structure with an entry per object (the exact size of the entry 
varies with the different proposals).

Under that assumption, I have another concern which is the lack of graceful 
degradation as the object counts grow and the in-memory data structures get 
larger. Everything seems fine until just a few objects get added then the 
system starts to page and performance drops dramatically (likely) to the point 
where Linux will start killing OSDs.

What's really needed is some kind of way to extend the lists into storage in 
way that's doesn't cause a zillion I/O operations.

I have some vague idea that some data structure like the LSM mechanism ought to 
be able to accomplish what we want. Some amount of the data structure (the most 
likely to be used) is held in DRAM [and backed to storage for restart] and the 
least likely to be used is flushed to storage with some mechanism that allows 
batched updates.

Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 5:57 AM
To: Wang, Zhiqiang
Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
  The part that worries me now is the speed with which we can load and
  manage such a list.  Assuming it is several hundred MB, it'll take a
  while to load that into memory and set up all the pointers (assuming
  a conventional linked list structure).  Maybe tens of seconds...

 I'm thinking of maintaining the lists at the PG level. That's to say,
 we have an active/inactive list for every PG. We can load the lists in
 parallel during rebooting. Also, the ~100 MB lists are split among
 different OSD nodes. Perhaps it does not need such long time to load
 them?

 
  I wonder if instead we should construct some sort of flat model
  where we load slabs of contiguous memory, 10's of MB each, and have
  the next/previous pointers be a (slab,position) pair.  That way we
  can load it into memory in big chunks, quickly, and be able to
  operate on it (adjust links) immediately.
 
  Another thought: currently we use the hobject_t hash only instead of
  the full object name.  We could continue to do the same, or we could
  do a hash pair (hobject_t hash + a different hash of the rest of the
  object) to keep the representation compact.  With a model lke the
  above, that could get the object representation down to 2 u32's.  A
  link could be a slab + position (2 more u32's), and if we have prev
  + next that'd be just 6x4=24 bytes per object.

 Looks like for an object, the head and the snapshot version have the
 same hobject hash. Thus we have to use the hash pair instead of just
 the hobject hash. But I still have two questions if we use the hash
 pair to represent an object.

 1) Does the hash pair uniquely identify an object? That's to say, is
 it possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

 2) We need a way to get the full object name from the hash pair, so
 that we know what objects to evict. But seems like we don't have a
 good way to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we 
can either embed the full ghobject_t (which means we lose the fixed-size 
property, and the per-object overhead goes way up.. probably from ~24 bytes to 
more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) 
hash position to find the object.  That's somewhat inefficient for FileStore 
(it'll list a directory of a hundred or so objects, probably, and iterate over 
them to find the right one), but for NewStore it will be quite fast (NewStore 
has all objects sorted into keys in rocksdb, so we just start listing at the 
right offset).  Usually we'll get the object right off, unless there are 
hobject_t hash collisions (already reasonably rare since it's a 2^32 space for 
the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader

Re: The design of the eviction improvement

2015-07-22 Thread Matt W. Benjamin
Hi,

- Allen Samuels allen.samu...@sandisk.com wrote:

 I'm very concerned about designing around the assumption that objects
 are ~1MB in size. That's probably a good assumption for block and HDFS
 dominated systems, but likely a very poor assumption about many object
 and file dominated systems.

++

 
 If I understand the proposals that have been discussed, each of them
 assumes in in-memory data structure with an entry per object (the
 exact size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of
 graceful degradation as the object counts grow and the in-memory data
 structures get larger. Everything seems fine until just a few objects
 get added then the system starts to page and performance drops
 dramatically (likely) to the point where Linux will start killing
 OSDs.

I'm not clear why that needs to be the case (but don't think it matters just 
now whether I do,
I was just letting folks know that we have MQ implementation(s)), but what 
you're describing seems consistent the model Sage and Greg, at least, are 
describing.

Matt

 
 What's really needed is some kind of way to extend the lists into
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism
 ought to be able to accomplish what we want. Some amount of the data
 structure (the most likely to be used) is held in DRAM [and backed to
 storage for restart] and the least likely to be used is flushed to
 storage with some mechanism that allows batched updates.
 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load
 and
   manage such a list.  Assuming it is several hundred MB, it'll take
 a
   while to load that into memory and set up all the pointers
 (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to
 say,
  we have an active/inactive list for every PG. We can load the lists
 in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to
 load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and
 have
   the next/previous pointers be a (slab,position) pair.  That way
 we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead
 of
   the full object name.  We could continue to do the same, or we
 could
   do a hash pair (hobject_t hash + a different hash of the rest of
 the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's. 
 A
   link could be a slab + position (2 more u32's), and if we have
 prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have
 the
  same hobject hash. Thus we have to use the hash pair instead of
 just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say,
 is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I
 think we can either embed the full ghobject_t (which means we lose the
 fixed-size property, and the per-object overhead goes way up..
 probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate
 objects starting at the (hobject_t) hash position to find the object. 
 That's somewhat inefficient for FileStore (it'll list a directory of a
 hundred or so objects, probably, and iterate over them to find the
 right one), but for NewStore it will be quite fast (NewStore has all
 objects sorted into keys in rocksdb, so we just start listing at the
 right offset).  Usually we'll get the object right off, unless there
 are hobject_t hash collisions (already reasonably rare since it's a
 2^32 space for the pool).
 
 Given that, I would lean toward the 2-hash fixed-sized records

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Don't we need to double-index the data structure?

We need it indexed by atime for the purposes of eviction, but we need it 
indexed by object name for the purposes of updating the list upon a usage.




Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 11:51 AM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 I'm very concerned about designing around the assumption that objects 
 are ~1MB in size. That's probably a good assumption for block and HDFS 
 dominated systems, but likely a very poor assumption about many object 
 and file dominated systems.
 
 If I understand the proposals that have been discussed, each of them 
 assumes in in-memory data structure with an entry per object (the 
 exact size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of 
 graceful degradation as the object counts grow and the in-memory data 
 structures get larger. Everything seems fine until just a few objects 
 get added then the system starts to page and performance drops 
 dramatically (likely) to the point where Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into 
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism 
 ought to be able to accomplish what we want. Some amount of the data 
 structure (the most likely to be used) is held in DRAM [and backed to 
 storage for restart] and the least likely to be used is flushed to 
 storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object - atime.

We keep a simple LRU of the top N objects in memory with the object-atime 
values.  When an object is accessed, it is moved or added to the top of the 
list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a 
simple object - atime fashion

When the agent runs, we just walk across that key range of the db the same way 
we currently enumerate objects.  For each record we use either the stored atime 
or the value in the in-memory LRU (it'll need to be dual-indexed by both a list 
and a hash map), whichever is newer.  We can use the same histogram estimation 
approach we do now to determine if the object in question is below the 
flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it has to 
do in check.

sage

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load 
   and manage such a list.  Assuming it is several hundred MB, it'll 
   take a while to load that into memory and set up all the pointers 
   (assuming a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to 
  say, we have an active/inactive list for every PG. We can load the 
  lists in parallel during rebooting. Also, the ~100 MB lists are 
  split among different OSD nodes. Perhaps it does not need such long 
  time to load them?
 
  
   I wonder if instead we should construct some sort of flat model 
   where we load slabs of contiguous memory, 10's of MB each, and 
   have the next/previous pointers be a (slab,position) pair.  That 
   way we can load it into memory in big chunks, quickly, and be able 
   to operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead 
   of the full object name.  We could continue to do the same, or we 
   could do a hash pair (hobject_t hash + a different hash of the 
   rest of the
   object) to keep the representation compact.  With a model lke the 
   above, that could get the object representation down to 2 u32's.  
   A link could be a slab + position (2 more u32's), and if we have 
   prev
   + next that'd be just 6x4=24 bytes

RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
  The part that worries me now is the speed with which we can load and 
  manage such a list.  Assuming it is several hundred MB, it'll take a 
  while to load that into memory and set up all the pointers (assuming a 
  conventional linked list structure).  Maybe tens of seconds...
 
 I'm thinking of maintaining the lists at the PG level. That's to say, we 
 have an active/inactive list for every PG. We can load the lists in 
 parallel during rebooting. Also, the ~100 MB lists are split among 
 different OSD nodes. Perhaps it does not need such long time to load 
 them?
 
  
  I wonder if instead we should construct some sort of flat model where 
  we load slabs of contiguous memory, 10's of MB each, and have the 
  next/previous pointers be a (slab,position) pair.  That way we can 
  load it into memory in big chunks, quickly, and be able to operate on 
  it (adjust links) immediately.
  
  Another thought: currently we use the hobject_t hash only instead of 
  the full object name.  We could continue to do the same, or we could 
  do a hash pair (hobject_t hash + a different hash of the rest of the 
  object) to keep the representation compact.  With a model lke the 
  above, that could get the object representation down to 2 u32's.  A 
  link could be a slab + position (2 more u32's), and if we have prev + 
  next that'd be just 6x4=24 bytes per object.
 
 Looks like for an object, the head and the snapshot version have the 
 same hobject hash. Thus we have to use the hash pair instead of just the 
 hobject hash. But I still have two questions if we use the hash pair to 
 represent an object.

 1) Does the hash pair uniquely identify an object? That's to say, is it 
 possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

 2) We need a way to get the full object name from the hash pair, so that 
 we know what objects to evict. But seems like we don't have a good way 
 to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I 
think we can either embed the full ghobject_t (which means we lose the 
fixed-size property, and the per-object overhead goes way up.. probably 
from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects 
starting at the (hobject_t) hash position to find the object.  That's 
somewhat inefficient for FileStore (it'll list a directory of a hundred or 
so objects, probably, and iterate over them to find the right one), but 
for NewStore it will be quite fast (NewStore has all objects sorted into 
keys in rocksdb, so we just start listing at the right offset).  Usually 
we'll get the object right off, unless there are hobject_t hash collisions 
(already reasonably rare since it's a 2^32 space for the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-22 Thread Wang, Zhiqiang
Hi Allen,

 -Original Message-
 From: Allen Samuels [mailto:allen.samu...@sandisk.com]
 Sent: Thursday, July 23, 2015 2:41 AM
 To: Sage Weil; Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 I'm very concerned about designing around the assumption that objects are
 ~1MB in size. That's probably a good assumption for block and HDFS dominated
 systems, but likely a very poor assumption about many object and file
 dominated systems.

This is true. If we have lots of small objects/files, the memory used for LRU 
lists could be extremely large.

 
 If I understand the proposals that have been discussed, each of them assumes
 in in-memory data structure with an entry per object (the exact size of the
 entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of graceful
 degradation as the object counts grow and the in-memory data structures get
 larger. Everything seems fine until just a few objects get added then the 
 system
 starts to page and performance drops dramatically (likely) to the point where
 Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into storage in 
 way
 that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism
 ought to be able to accomplish what we want. Some amount of the data
 structure (the most likely to be used) is held in DRAM [and backed to storage
 for restart] and the least likely to be used is flushed to storage with some
 mechanism that allows batched updates.

The LSM mechanism could solve the memory consumption problem. But I guess the 
process to choose which objects to evict is complex and inefficient. Also, 
after evicting some objects, we need to update the on-disk file to remove the 
entries of these objects. This is inefficient, too.

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load and
   manage such a list.  Assuming it is several hundred MB, it'll take a
   while to load that into memory and set up all the pointers (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to say,
  we have an active/inactive list for every PG. We can load the lists in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and have
   the next/previous pointers be a (slab,position) pair.  That way we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead of
   the full object name.  We could continue to do the same, or we could
   do a hash pair (hobject_t hash + a different hash of the rest of the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's.  A
   link could be a slab + position (2 more u32's), and if we have prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have the
  same hobject hash. Thus we have to use the hash pair instead of just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say, is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I think 
 we can
 either embed the full ghobject_t (which means we lose the fixed-size property,
 and the per-object overhead goes way up.. probably from ~24 bytes to more
 like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) 
 hash
 position to find the object.  That's somewhat inefficient for FileStore 
 (it'll list a
 directory of a hundred or so objects, probably, and iterate over them to find 
 the
 right one), but for NewStore it will be quite

RE: The design of the eviction improvement

2015-07-22 Thread Wang, Zhiqiang
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Thursday, July 23, 2015 2:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that objects
  are ~1MB in size. That's probably a good assumption for block and HDFS
  dominated systems, but likely a very poor assumption about many object
  and file dominated systems.
 
  If I understand the proposals that have been discussed, each of them
  assumes in in-memory data structure with an entry per object (the
  exact size of the entry varies with the different proposals).
 
  Under that assumption, I have another concern which is the lack of
  graceful degradation as the object counts grow and the in-memory data
  structures get larger. Everything seems fine until just a few objects
  get added then the system starts to page and performance drops
  dramatically (likely) to the point where Linux will start killing OSDs.
 
  What's really needed is some kind of way to extend the lists into
  storage in way that's doesn't cause a zillion I/O operations.
 
  I have some vague idea that some data structure like the LSM mechanism
  ought to be able to accomplish what we want. Some amount of the data
  structure (the most likely to be used) is held in DRAM [and backed to
  storage for restart] and the least likely to be used is flushed to
  storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly loaded
  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a
 simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same
 way we currently enumerate objects.  For each record we use either the
 stored atime or the value in the in-memory LRU (it'll need to be dual-indexed 
 by
 both a list and a hash map), whichever is newer.  We can use the same
 histogram estimation approach we do now to determine if the object in
 question is below the flush/evict threshold.

This looks similar to what we do now, except it keeps a LRU of the 
object-atime mapping in RAM, instead of using a hitset. The object age 
calculated using the atime would be more accurate than the current hitset and 
mtime approach.

One comment is that I think if we can find a record of an object in the 
in-memory LRU list, we don't need to query the DB since the atime in the LRU 
list is for sure newer than that in the db (if it has).

My concern on this approach is whether the evict decision made by the histogram 
estimation approach is good enough. It only measures 'recency'. And it made the 
decision based on some threshold, not starting from the oldest. In contrast, 
most of the practical algorithms made the decision based on both 'recency' and 
'frequency' (accessed once recently vs. accessed twice or more recently).

If we believe the histogram estimation approach is good enough, I think we can 
easily integrate the idea above with 2Q.
1) The in-memory LRU lists are the same as 2Q. i.e., there are active/inactive 
lists, and the movements between the two lists are the same as what I stated in 
the original mail. But we have a limit of the size of the lists. Only the top N 
hottest objects are kept in the lists.
2) When the size of the lists exceed N*(1.x), evict the oldest items (N .. 
N*1.x) to db store.
3) N could be dynamically adjusted based on the average size of objects in the 
PG.
4) Evict decision are made by the histogram estimation approach. Plus I think 
we want to evict those objects which are not in the in-memory lists first.

 
 The LSM does the work of sorting/compacting the atime info, while we avoid
 touching it at all for the hottest objects to keep the amount of work it has 
 to do
 in check.
 
 sage
 
 
  Allen Samuels
  Software Architect, Systems and Software Solutions
 
  2880 Junction Avenue, San Jose, CA 95134
  T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Wednesday, July 22, 2015 5:57 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: RE: The design of the eviction improvement
 
  On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
The part that worries me now is the speed with which we can load
and manage such a list.  Assuming it is several hundred MB, it'll
take a while to load that into memory and set up all the pointers

RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Allen Samuels wrote:
 Don't we need to double-index the data structure?
 
 We need it indexed by atime for the purposes of eviction, but we need it 
 indexed by object name for the purposes of updating the list upon a 
 usage.

If you use the same approach the agent uses now (iterate over items, 
evict/trim anything in bottom end of observed age distribution) you can 
get away without the double-index.  Iterating over the LSM should be quite 
cheap.  I'd be more worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to 
something like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end 
of the day the memory requirements for a cache-tier OSD are higher and 
inversely proportional to the object size.  We can make the OSD 
flush/evict more aggressively if the memory utilization (due to a high 
object count) gets out of hand as a safety mechanism.  Paying a few extra 
$$ for RAM isn't the end of the world I'm guessing when the performance 
payoff is significant...

sage


  
 
 
 
 Allen Samuels
 Software Architect, Systems and Software Solutions 
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 11:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that objects 
  are ~1MB in size. That's probably a good assumption for block and HDFS 
  dominated systems, but likely a very poor assumption about many object 
  and file dominated systems.
  
  If I understand the proposals that have been discussed, each of them 
  assumes in in-memory data structure with an entry per object (the 
  exact size of the entry varies with the different proposals).
  
  Under that assumption, I have another concern which is the lack of 
  graceful degradation as the object counts grow and the in-memory data 
  structures get larger. Everything seems fine until just a few objects 
  get added then the system starts to page and performance drops 
  dramatically (likely) to the point where Linux will start killing OSDs.
  
  What's really needed is some kind of way to extend the lists into 
  storage in way that's doesn't cause a zillion I/O operations.
  
  I have some vague idea that some data structure like the LSM mechanism 
  ought to be able to accomplish what we want. Some amount of the data 
  structure (the most likely to be used) is held in DRAM [and backed to 
  storage for restart] and the least likely to be used is flushed to 
  storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime 
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly loaded
  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in 
 a simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same 
 way we currently enumerate objects.  For each record we use either the stored 
 atime or the value in the in-memory LRU (it'll need to be dual-indexed by 
 both a list and a hash map), whichever is newer.  We can use the same 
 histogram estimation approach we do now to determine if the object in 
 question is below the flush/evict threshold.
 
 The LSM does the work of sorting/compacting the atime info, while we avoid 
 touching it at all for the hottest objects to keep the amount of work it has 
 to do in check.
 
 sage
 
  
  Allen Samuels
  Software Architect, Systems and Software Solutions
  
  2880 Junction Avenue, San Jose, CA 95134
  T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
  
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org 
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Wednesday, July 22, 2015 5:57 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: RE: The design of the eviction improvement
  
  On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
The part that worries me now is the speed with which we can load 
and manage such a list.  Assuming it is several hundred MB, it'll 
take a while to load that into memory and set up all the pointers 
(assuming a conventional linked list structure).  Maybe tens of 
seconds...
  
   I'm thinking of maintaining the lists at the PG level. That's

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Yes the cost of the insertions with the current scheme is probably prohibitive. 
Wouldn't it approach the same amount of time as just having atime turned on in 
the file system? 

My concern about the memory is mostly that we ensure whatever algorithm is 
selected degrades gracefully when you get high counts of small objects. I agree 
that paying $ for RAM that translates into actual performance isn't really a 
problem. It really boils down to your workload and access pattern.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Wednesday, July 22, 2015 2:53 PM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 Don't we need to double-index the data structure?
 
 We need it indexed by atime for the purposes of eviction, but we need 
 it indexed by object name for the purposes of updating the list upon a 
 usage.

If you use the same approach the agent uses now (iterate over items, evict/trim 
anything in bottom end of observed age distribution) you can get away without 
the double-index.  Iterating over the LSM should be quite cheap.  I'd be more 
worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to something 
like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end of the 
day the memory requirements for a cache-tier OSD are higher and inversely 
proportional to the object size.  We can make the OSD flush/evict more 
aggressively if the memory utilization (due to a high object count) gets out of 
hand as a safety mechanism.  Paying a few extra $$ for RAM isn't the end of the 
world I'm guessing when the performance payoff is significant...

sage


  
 
 
 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 11:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that 
  objects are ~1MB in size. That's probably a good assumption for 
  block and HDFS dominated systems, but likely a very poor assumption 
  about many object and file dominated systems.
  
  If I understand the proposals that have been discussed, each of them 
  assumes in in-memory data structure with an entry per object (the 
  exact size of the entry varies with the different proposals).
  
  Under that assumption, I have another concern which is the lack of 
  graceful degradation as the object counts grow and the in-memory 
  data structures get larger. Everything seems fine until just a few 
  objects get added then the system starts to page and performance 
  drops dramatically (likely) to the point where Linux will start killing 
  OSDs.
  
  What's really needed is some kind of way to extend the lists into 
  storage in way that's doesn't cause a zillion I/O operations.
  
  I have some vague idea that some data structure like the LSM 
  mechanism ought to be able to accomplish what we want. Some amount 
  of the data structure (the most likely to be used) is held in DRAM 
  [and backed to storage for restart] and the least likely to be used 
  is flushed to storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime 
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly 
 loaded
  - write our records for the oldest items (N .. N*1.x) to 
 leveldb/rocksdb in a simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same 
 way we currently enumerate objects.  For each record we use either the stored 
 atime or the value in the in-memory LRU (it'll need to be dual-indexed by 
 both a list and a hash map), whichever is newer.  We can use the same 
 histogram estimation approach we do now to determine if the object in 
 question is below the flush/evict threshold.
 
 The LSM does the work of sorting/compacting the atime info, while we avoid 
 touching it at all for the hottest objects to keep the amount of work it has 
 to do in check.
 
 sage

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin
Thanks for the explanations, Greg.

- Gregory Farnum g...@gregs42.com wrote:

 On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com
 wrote:
  Hi,
 
  Couple of points.
 
  1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
 implementation
  with 2 levels currently, plus a pinned queue, that addresses stuff
 like partitioning (sharding), scan resistance, and coordination
 w/lookup tables.  We might extend/re-use it.
 
  2) I'm a bit confused by active/inactive vocabulary, dimensioning of
 cache
  segments (are you proposing to/do we now always cache whole
 objects?), and cost
  of looking for dirty objects;  I suspect that it makes sense to
 amortize the
  cost of locating segments eligible to be flushed, rather than
 minimize
  bookkeeping.
 
 We make caching decisions in terms of whole objects right now, yeah.
 There's really nothing in the system that's capable of doing segments
 within an object, and it's not just about tracking a little more
 metadata about dirty objects — the way we handle snapshots, etc would
 have to be reworked if we were allowing partial-object caching. Plus
 keep in mind the IO cost of the bookkeeping — it needs to be either
 consistently persisted to disk or reconstructable from whatever
 happens to be in the object. That can get expensive really fast.
 -Greg

For current semantics/structure of PGs + specific tier held fixed, makes
sense.  For our object addressing currently, we have a greater requirement
for partial object caching.  (Partly, we did this to achieve periodicity
w/sequential I/O.)  I think broadly, there are large performance
tradeoffs here.  In AFS and DCE, there is full consistency in materialized
caches.  Also, caches are dimensioned by chunks.  If the cache is materialized
in memory, the semantics aren't those of disk.  Basically, consistency
guarantees are policy.  Different snapshot mechanisms, or omtting them, e.g.,
should logically enable relaxed consistency, modulo policy.

Matt

 
 
  Matt
 
  - Zhiqiang Wang zhiqiang.w...@intel.com wrote:
 
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement
  
   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,
   
This is a follow-up of one of the CDS session at
  
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
   ng_eviction. We discussed the drawbacks of the current eviction
  algorithm and
   several ways to improve it. Seems like the LRU variants is the
 right
  way to go. I
   come up with some design points after the CDS, and want to
 discuss
  it with you.
   It is an approximate 2Q algorithm, combining some benefits of
 the
  clock
   algorithm, similar to what the linux kernel does for the page
  cache.
  
   Unfortunately I missed this last CDS so I'm behind on the
  discussion.  I have a
   few questions though...
  
# Design points:
   
## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
  have a
max_size, which limits the max number of elements in the list.
  This
mostly looks like a MRU, though its name implies they are
 LRUs.
  Since
the object size may vary in a PG, it's not possible to
 caculate
  the
total number of objects which the cache tier can hold ahead of
  time.
We need a new LRU implementation with no limit on the size.
  
   This last sentence seems to me to be the crux of it.  Assuming
 we
  have an
   OSD based by flash storing O(n) objects, we need a way to
 maintain
  an LRU of
   O(n) objects in memory.  The current hitset-based approach was
 taken
  based
   on the assumption that this wasn't feasible--or at least we
 didn't
  know how to
   implmement such a thing.  If it is, or we simply want to
 stipulate
  that cache
   tier OSDs get gobs of RAM to make it possible, then lots of
 better
  options
   become possible...
  
   Let's say you have a 1TB SSD, with an average object size of 1MB
 --
  that's
   1 million objects.  At maybe ~100bytes per object of RAM for an
 LRU
  entry
   that's 100MB... so not so unreasonable, perhaps!
 
  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many
 memory
  :-)
 
  
- Two lists for each PG: active and inactive Objects are first
  put
into the inactive list when they are accessed, and moved
 between
  these two
   lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of
 the
inactive list
2) If it's in the inactive list, and the referenced flag is
 not
  set, the referenced
   flag is set, and it's

RE: The design of the eviction improvement

2015-07-21 Thread Wang, Zhiqiang

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, July 21, 2015 9:29 PM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement
  
   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,
   
This is a follow-up of one of the CDS session at
   http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
   tieri ng_eviction. We discussed the drawbacks of the current
   eviction algorithm and several ways to improve it. Seems like the
   LRU variants is the right way to go. I come up with some design
   points after the CDS, and want to discuss it with you.
   It is an approximate 2Q algorithm, combining some benefits of the
   clock algorithm, similar to what the linux kernel does for the page cache.
  
   Unfortunately I missed this last CDS so I'm behind on the
   discussion.  I have a few questions though...
  
# Design points:
   
## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
have a max_size, which limits the max number of elements in the
list. This mostly looks like a MRU, though its name implies they
are LRUs. Since the object size may vary in a PG, it's not
possible to caculate the total number of objects which the cache tier 
can
 hold ahead of time.
We need a new LRU implementation with no limit on the size.
  
   This last sentence seems to me to be the crux of it.  Assuming we
   have an OSD based by flash storing O(n) objects, we need a way to
   maintain an LRU of
   O(n) objects in memory.  The current hitset-based approach was taken
   based on the assumption that this wasn't feasible--or at least we
   didn't know how to implmement such a thing.  If it is, or we simply
   want to stipulate that cache tier OSDs get gobs of RAM to make it
   possible, then lots of better options become possible...
  
   Let's say you have a 1TB SSD, with an average object size of 1MB --
   that's
   1 million objects.  At maybe ~100bytes per object of RAM for an LRU
   entry that's 100MB... so not so unreasonable, perhaps!
 
  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many memory
  :-)
 
 The part that worries me now is the speed with which we can load and manage
 such a list.  Assuming it is several hundred MB, it'll take a while to load 
 that
 into memory and set up all the pointers (assuming a conventional linked list
 structure).  Maybe tens of seconds...

I'm thinking of maintaining the lists at the PG level. That's to say, we have 
an active/inactive list for every PG. We can load the lists in parallel during 
rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps 
it does not need such long time to load them?

 
 I wonder if instead we should construct some sort of flat model where we load
 slabs of contiguous memory, 10's of MB each, and have the next/previous
 pointers be a (slab,position) pair.  That way we can load it into memory in 
 big
 chunks, quickly, and be able to operate on it (adjust
 links) immediately.
 
 Another thought: currently we use the hobject_t hash only instead of the full
 object name.  We could continue to do the same, or we could do a hash pair
 (hobject_t hash + a different hash of the rest of the object) to keep the
 representation compact.  With a model lke the above, that could get the
 object representation down to 2 u32's.  A link could be a slab + position (2
 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per 
 object.

Looks like for an object, the head and the snapshot version have the same 
hobject hash. Thus we have to use the hash pair instead of just the hobject 
hash. But I still have two questions if we use the hash pair to represent an 
object.
1) Does the hash pair uniquely identify an object? That's to say, is it 
possible for two objects to have the same hash pair?
2) We need a way to get the full object name from the hash pair, so that we 
know what objects to evict. But seems like we don't have a good way to do this?

 
 With fixed-sized slots on the slabs, the slab allocator could be very simple..
 maybe just a bitmap, a free counter, and any other trivial optimizations to
 make finding a slab's next free a slot nice and quick.
 
- Two lists for each PG: active and inactive Objects are first put
into the inactive list when they are accessed, and moved between
these two
   lists based on some criteria.
Object flag: active, referenced

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin
Hi,

- Zhiqiang Wang zhiqiang.w...@intel.com wrote:

 Hi Matt,
 
  -Original Message-
  From: Matt W. Benjamin [mailto:m...@cohortfs.com]
  Sent: Tuesday, July 21, 2015 10:16 PM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
  Subject: Re: The design of the eviction improvement
  
  Hi,
  
  Couple of points.
  
  1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
  implementation with 2 levels currently, plus a pinned queue, that
 addresses
  stuff like partitioning (sharding), scan resistance, and
 coordination w/lookup
  tables.  We might extend/re-use it.
 
 The MQ algorithm is more complex, and seems like it has more overhead
 than 2Q. The approximate 2Q algorithm here combines some benefits of
 the clock algorithm, and works well on the linux kernel. MQ could be
 another choice. There are some other candidates like LIRS, ARC, etc.,
 which have been deployed in some practical systems.

MQ has been deployed in practical systems, and is more general.

 
  
  2) I'm a bit confused by active/inactive vocabulary, dimensioning of
 cache
  segments (are you proposing to/do we now always cache whole
 objects?), and
  cost of looking for dirty objects;  I suspect that it makes sense
 to amortize
  the cost of locating segments eligible to be flushed, rather than
 minimize
  bookkeeping.
 
 Though currently the caching decisions are made in the unit of object
 as Greg pointed out in another mail, I think we still have something
 to improve here. I'll come back to this some time later.
 
  
  Matt
  
  - Zhiqiang Wang zhiqiang.w...@intel.com wrote:
  
-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage
 Weil
Sent: Tuesday, July 21, 2015 6:38 AM
To: Wang, Zhiqiang
Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: Re: The design of the eviction improvement
   
On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
 Hi all,

 This is a follow-up of one of the CDS session at
   
  
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
   eri
ng_eviction. We discussed the drawbacks of the current eviction
   algorithm and
several ways to improve it. Seems like the LRU variants is the
 right
   way to go. I
come up with some design points after the CDS, and want to
 discuss
   it with you.
It is an approximate 2Q algorithm, combining some benefits of
 the
   clock
algorithm, similar to what the linux kernel does for the page
   cache.
   
Unfortunately I missed this last CDS so I'm behind on the
   discussion.  I have a
few questions though...
   
 # Design points:

 ## LRU lists
 - Maintain LRU lists at the PG level.
 The SharedLRU and SimpleLRU implementation in the current
 code
   have a
 max_size, which limits the max number of elements in the
 list.
   This
 mostly looks like a MRU, though its name implies they are
 LRUs.
   Since
 the object size may vary in a PG, it's not possible to
 caculate
   the
 total number of objects which the cache tier can hold ahead
 of
   time.
 We need a new LRU implementation with no limit on the size.
   
This last sentence seems to me to be the crux of it.  Assuming
 we
   have an
OSD based by flash storing O(n) objects, we need a way to
 maintain
   an LRU of
O(n) objects in memory.  The current hitset-based approach was
 taken
   based
on the assumption that this wasn't feasible--or at least we
 didn't
   know how to
implmement such a thing.  If it is, or we simply want to
 stipulate
   that cache
tier OSDs get gobs of RAM to make it possible, then lots of
 better
   options
become possible...
   
Let's say you have a 1TB SSD, with an average object size of 1MB
 --
   that's
1 million objects.  At maybe ~100bytes per object of RAM for an
 LRU
   entry
that's 100MB... so not so unreasonable, perhaps!
  
   I was having the same question before proposing this. I did the
   similar calculation and thought it would be ok to use this many
 memory
   :-)
  
   
 - Two lists for each PG: active and inactive Objects are
 first
   put
 into the inactive list when they are accessed, and moved
 between
   these two
lists based on some criteria.
 Object flag: active, referenced, unevictable, dirty.
 - When an object is accessed:
 1) If it's not in both of the lists, it's put on the top of
 the
 inactive list
 2) If it's in the inactive list, and the referenced flag is
 not
   set, the referenced
flag is set, and it's moved to the top of the inactive list.
 3) If it's in the inactive list, and the referenced flag is
 set,
   the referenced flag
is cleared, and it's removed from the inactive list, and put on
 top
   of the active
list.
 4) If it's in the active list, and the referenced flag is not
 set,
   the referenced
flag

RE: The design of the eviction improvement

2015-07-21 Thread Wang, Zhiqiang
Hi Matt,

 -Original Message-
 From: Matt W. Benjamin [mailto:m...@cohortfs.com]
 Sent: Tuesday, July 21, 2015 10:16 PM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
 Subject: Re: The design of the eviction improvement
 
 Hi,
 
 Couple of points.
 
 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
 implementation with 2 levels currently, plus a pinned queue, that addresses
 stuff like partitioning (sharding), scan resistance, and coordination 
 w/lookup
 tables.  We might extend/re-use it.

The MQ algorithm is more complex, and seems like it has more overhead than 2Q. 
The approximate 2Q algorithm here combines some benefits of the clock 
algorithm, and works well on the linux kernel. MQ could be another choice. 
There are some other candidates like LIRS, ARC, etc., which have been deployed 
in some practical systems.

 
 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
 segments (are you proposing to/do we now always cache whole objects?), and
 cost of looking for dirty objects;  I suspect that it makes sense to 
 amortize
 the cost of locating segments eligible to be flushed, rather than minimize
 bookkeeping.

Though currently the caching decisions are made in the unit of object as Greg 
pointed out in another mail, I think we still have something to improve here. 
I'll come back to this some time later.

 
 Matt
 
 - Zhiqiang Wang zhiqiang.w...@intel.com wrote:
 
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
   Sent: Tuesday, July 21, 2015 6:38 AM
   To: Wang, Zhiqiang
   Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
   Subject: Re: The design of the eviction improvement
  
   On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
Hi all,
   
This is a follow-up of one of the CDS session at
  
  http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
  eri
   ng_eviction. We discussed the drawbacks of the current eviction
  algorithm and
   several ways to improve it. Seems like the LRU variants is the right
  way to go. I
   come up with some design points after the CDS, and want to discuss
  it with you.
   It is an approximate 2Q algorithm, combining some benefits of the
  clock
   algorithm, similar to what the linux kernel does for the page
  cache.
  
   Unfortunately I missed this last CDS so I'm behind on the
  discussion.  I have a
   few questions though...
  
# Design points:
   
## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code
  have a
max_size, which limits the max number of elements in the list.
  This
mostly looks like a MRU, though its name implies they are LRUs.
  Since
the object size may vary in a PG, it's not possible to caculate
  the
total number of objects which the cache tier can hold ahead of
  time.
We need a new LRU implementation with no limit on the size.
  
   This last sentence seems to me to be the crux of it.  Assuming we
  have an
   OSD based by flash storing O(n) objects, we need a way to maintain
  an LRU of
   O(n) objects in memory.  The current hitset-based approach was taken
  based
   on the assumption that this wasn't feasible--or at least we didn't
  know how to
   implmement such a thing.  If it is, or we simply want to stipulate
  that cache
   tier OSDs get gobs of RAM to make it possible, then lots of better
  options
   become possible...
  
   Let's say you have a 1TB SSD, with an average object size of 1MB --
  that's
   1 million objects.  At maybe ~100bytes per object of RAM for an LRU
  entry
   that's 100MB... so not so unreasonable, perhaps!
 
  I was having the same question before proposing this. I did the
  similar calculation and thought it would be ok to use this many memory
  :-)
 
  
- Two lists for each PG: active and inactive Objects are first
  put
into the inactive list when they are accessed, and moved between
  these two
   lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the
inactive list
2) If it's in the inactive list, and the referenced flag is not
  set, the referenced
   flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set,
  the referenced flag
   is cleared, and it's removed from the inactive list, and put on top
  of the active
   list.
4) If it's in the active list, and the referenced flag is not set,
  the referenced
   flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set,
  it's moved to the top
   of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to
  evict

RE: The design of the eviction improvement

2015-07-21 Thread Sage Weil
On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement
  
  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,
  
   This is a follow-up of one of the CDS session at
  http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction algorithm 
  and
  several ways to improve it. Seems like the LRU variants is the right way to 
  go. I
  come up with some design points after the CDS, and want to discuss it with 
  you.
  It is an approximate 2Q algorithm, combining some benefits of the clock
  algorithm, similar to what the linux kernel does for the page cache.
  
  Unfortunately I missed this last CDS so I'm behind on the discussion.  I 
  have a
  few questions though...
  
   # Design points:
  
   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code have a
   max_size, which limits the max number of elements in the list. This
   mostly looks like a MRU, though its name implies they are LRUs. Since
   the object size may vary in a PG, it's not possible to caculate the
   total number of objects which the cache tier can hold ahead of time.
   We need a new LRU implementation with no limit on the size.
  
  This last sentence seems to me to be the crux of it.  Assuming we have an
  OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken based
  on the assumption that this wasn't feasible--or at least we didn't know how 
  to
  implmement such a thing.  If it is, or we simply want to stipulate that 
  cache
  tier OSDs get gobs of RAM to make it possible, then lots of better options
  become possible...
  
  Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
  that's 100MB... so not so unreasonable, perhaps!
 
 I was having the same question before proposing this. I did the similar 
 calculation and thought it would be ok to use this many memory :-)

The part that worries me now is the speed with which we can load and 
manage such a list.  Assuming it is several hundred MB, it'll take a while 
to load that into memory and set up all the pointers (assuming 
a conventional linked list structure).  Maybe tens of seconds...

I wonder if instead we should construct some sort of flat model where we 
load slabs of contiguous memory, 10's of MB each, and have the 
next/previous pointers be a (slab,position) pair.  That way we can load it 
into memory in big chunks, quickly, and be able to operate on it (adjust 
links) immediately.

Another thought: currently we use the hobject_t hash only instead of the 
full object name.  We could continue to do the same, or we could do a hash 
pair (hobject_t hash + a different hash of the rest of the object) to keep 
the representation compact.  With a model lke the above, that could get 
the object representation down to 2 u32's.  A link could be a slab + 
position (2 more u32's), and if we have prev + next that'd be just 6x4=24 
bytes per object.

With fixed-sized slots on the slabs, the slab allocator could be very 
simple.. maybe just a bitmap, a free counter, and any other trivial 
optimizations to make finding a slab's next free a slot nice and quick.

   - Two lists for each PG: active and inactive Objects are first put
   into the inactive list when they are accessed, and moved between these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not set, the 
   referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set, the 
   referenced flag
  is cleared, and it's removed from the inactive list, and put on top of the 
  active
  list.
   4) If it's in the active list, and the referenced flag is not set, the 
   referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set, it's moved 
   to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to evict. They 
   are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low, some of 
   the
  objects at the bottom of the active list are moved to the inactive list. 
  For those
  objects which have the referenced flag

Re: The design of the eviction improvement

2015-07-21 Thread Matt W. Benjamin
Hi,

Couple of points.

1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU 
implementation
with 2 levels currently, plus a pinned queue, that addresses stuff like 
partitioning (sharding), scan resistance, and coordination w/lookup tables.  
We might extend/re-use it.

2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
segments (are you proposing to/do we now always cache whole objects?), and cost
of looking for dirty objects;  I suspect that it makes sense to amortize the
cost of locating segments eligible to be flushed, rather than minimize
bookkeeping.

Matt

- Zhiqiang Wang zhiqiang.w...@intel.com wrote:

  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement
  
  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,
  
   This is a follow-up of one of the CDS session at
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction
 algorithm and
  several ways to improve it. Seems like the LRU variants is the right
 way to go. I
  come up with some design points after the CDS, and want to discuss
 it with you.
  It is an approximate 2Q algorithm, combining some benefits of the
 clock
  algorithm, similar to what the linux kernel does for the page
 cache.
  
  Unfortunately I missed this last CDS so I'm behind on the
 discussion.  I have a
  few questions though...
  
   # Design points:
  
   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code
 have a
   max_size, which limits the max number of elements in the list.
 This
   mostly looks like a MRU, though its name implies they are LRUs.
 Since
   the object size may vary in a PG, it's not possible to caculate
 the
   total number of objects which the cache tier can hold ahead of
 time.
   We need a new LRU implementation with no limit on the size.
  
  This last sentence seems to me to be the crux of it.  Assuming we
 have an
  OSD based by flash storing O(n) objects, we need a way to maintain
 an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken
 based
  on the assumption that this wasn't feasible--or at least we didn't
 know how to
  implmement such a thing.  If it is, or we simply want to stipulate
 that cache
  tier OSDs get gobs of RAM to make it possible, then lots of better
 options
  become possible...
  
  Let's say you have a 1TB SSD, with an average object size of 1MB --
 that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU
 entry
  that's 100MB... so not so unreasonable, perhaps!
 
 I was having the same question before proposing this. I did the
 similar calculation and thought it would be ok to use this many memory
 :-)
 
  
   - Two lists for each PG: active and inactive Objects are first
 put
   into the inactive list when they are accessed, and moved between
 these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not
 set, the referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set,
 the referenced flag
  is cleared, and it's removed from the inactive list, and put on top
 of the active
  list.
   4) If it's in the active list, and the referenced flag is not set,
 the referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set,
 it's moved to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to
 evict. They are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low,
 some of the
  objects at the bottom of the active list are moved to the inactive
 list. For those
  objects which have the referenced flag set, they are given one more
 chance in
  the active list. They are moved to the top of the active list with
 the referenced
  flag cleared. For those objects which don't have the referenced flag
 set, they
  are moved to the inactive list, with the referenced flag set. So
 that they can be
  quickly promoted to the active list when necessary.
  
   ## Combine flush with eviction
   - When evicting an object, if it's dirty, it's flushed first.
 After flushing, it's
  evicted. If not dirty, it's evicted directly.
   - This means that we won't have separate activities and won't set
 different
  ratios for flush and evict. Is there a need to do so

Re: The design of the eviction improvement

2015-07-21 Thread Gregory Farnum
On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin m...@cohortfs.com wrote:
 Hi,

 Couple of points.

 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU 
 implementation
 with 2 levels currently, plus a pinned queue, that addresses stuff like 
 partitioning (sharding), scan resistance, and coordination w/lookup tables. 
  We might extend/re-use it.

 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
 segments (are you proposing to/do we now always cache whole objects?), and 
 cost
 of looking for dirty objects;  I suspect that it makes sense to amortize the
 cost of locating segments eligible to be flushed, rather than minimize
 bookkeeping.

We make caching decisions in terms of whole objects right now, yeah.
There's really nothing in the system that's capable of doing segments
within an object, and it's not just about tracking a little more
metadata about dirty objects — the way we handle snapshots, etc would
have to be reworked if we were allowing partial-object caching. Plus
keep in mind the IO cost of the bookkeeping — it needs to be either
consistently persisted to disk or reconstructable from whatever
happens to be in the object. That can get expensive really fast.
-Greg


 Matt

 - Zhiqiang Wang zhiqiang.w...@intel.com wrote:

  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Tuesday, July 21, 2015 6:38 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: Re: The design of the eviction improvement
 
  On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
   Hi all,
  
   This is a follow-up of one of the CDS session at
 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
  ng_eviction. We discussed the drawbacks of the current eviction
 algorithm and
  several ways to improve it. Seems like the LRU variants is the right
 way to go. I
  come up with some design points after the CDS, and want to discuss
 it with you.
  It is an approximate 2Q algorithm, combining some benefits of the
 clock
  algorithm, similar to what the linux kernel does for the page
 cache.
 
  Unfortunately I missed this last CDS so I'm behind on the
 discussion.  I have a
  few questions though...
 
   # Design points:
  
   ## LRU lists
   - Maintain LRU lists at the PG level.
   The SharedLRU and SimpleLRU implementation in the current code
 have a
   max_size, which limits the max number of elements in the list.
 This
   mostly looks like a MRU, though its name implies they are LRUs.
 Since
   the object size may vary in a PG, it's not possible to caculate
 the
   total number of objects which the cache tier can hold ahead of
 time.
   We need a new LRU implementation with no limit on the size.
 
  This last sentence seems to me to be the crux of it.  Assuming we
 have an
  OSD based by flash storing O(n) objects, we need a way to maintain
 an LRU of
  O(n) objects in memory.  The current hitset-based approach was taken
 based
  on the assumption that this wasn't feasible--or at least we didn't
 know how to
  implmement such a thing.  If it is, or we simply want to stipulate
 that cache
  tier OSDs get gobs of RAM to make it possible, then lots of better
 options
  become possible...
 
  Let's say you have a 1TB SSD, with an average object size of 1MB --
 that's
  1 million objects.  At maybe ~100bytes per object of RAM for an LRU
 entry
  that's 100MB... so not so unreasonable, perhaps!

 I was having the same question before proposing this. I did the
 similar calculation and thought it would be ok to use this many memory
 :-)

 
   - Two lists for each PG: active and inactive Objects are first
 put
   into the inactive list when they are accessed, and moved between
 these two
  lists based on some criteria.
   Object flag: active, referenced, unevictable, dirty.
   - When an object is accessed:
   1) If it's not in both of the lists, it's put on the top of the
   inactive list
   2) If it's in the inactive list, and the referenced flag is not
 set, the referenced
  flag is set, and it's moved to the top of the inactive list.
   3) If it's in the inactive list, and the referenced flag is set,
 the referenced flag
  is cleared, and it's removed from the inactive list, and put on top
 of the active
  list.
   4) If it's in the active list, and the referenced flag is not set,
 the referenced
  flag is set, and it's moved to the top of the active list.
   5) If it's in the active list, and the referenced flag is set,
 it's moved to the top
  of the active list.
   - When selecting objects to evict:
   1) Objects at the bottom of the inactive list are selected to
 evict. They are
  removed from the inactive list.
   2) If the number of the objects in the inactive list becomes low,
 some of the
  objects at the bottom of the active list are moved to the inactive
 list. For those
  objects which have the referenced flag set, they are given one

Re: The design of the eviction improvement

2015-07-20 Thread Sage Weil
On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
 Hi all,
 
 This is a follow-up of one of the CDS session at 
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
  We discussed the drawbacks of the current eviction algorithm and several 
 ways to improve it. Seems like the LRU variants is the right way to go. I 
 come up with some design points after the CDS, and want to discuss it with 
 you. It is an approximate 2Q algorithm, combining some benefits of the clock 
 algorithm, similar to what the linux kernel does for the page cache.

Unfortunately I missed this last CDS so I'm behind on the discussion.  I 
have a few questions though...
 
 # Design points:
 
 ## LRU lists
 - Maintain LRU lists at the PG level.
 The SharedLRU and SimpleLRU implementation in the current code have a 
 max_size, which limits the max number of elements in the list. This 
 mostly looks like a MRU, though its name implies they are LRUs. Since 
 the object size may vary in a PG, it's not possible to caculate the 
 total number of objects which the cache tier can hold ahead of time. We 
 need a new LRU implementation with no limit on the size.

This last sentence seems to me to be the crux of it.  Assuming we have an 
OSD based by flash storing O(n) objects, we need a way to maintain an LRU 
of O(n) objects in memory.  The current hitset-based approach was taken 
based on the assumption that this wasn't feasible--or at least we 
didn't know how to implmement such a thing.  If it is, or we simply want 
to stipulate that cache tier OSDs get gobs of RAM to make it possible, 
then lots of better options become possible...

Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 
1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry 
that's 100MB... so not so unreasonable, perhaps!

 - Two lists for each PG: active and inactive
 Objects are first put into the inactive list when they are accessed, and 
 moved between these two lists based on some criteria.
 Object flag: active, referenced, unevictable, dirty.
 - When an object is accessed:
 1) If it's not in both of the lists, it's put on the top of the inactive list
 2) If it's in the inactive list, and the referenced flag is not set, the 
 referenced flag is set, and it's moved to the top of the inactive list.
 3) If it's in the inactive list, and the referenced flag is set, the 
 referenced flag is cleared, and it's removed from the inactive list, and put 
 on top of the active list.
 4) If it's in the active list, and the referenced flag is not set, the 
 referenced flag is set, and it's moved to the top of the active list.
 5) If it's in the active list, and the referenced flag is set, it's moved to 
 the top of the active list.
 - When selecting objects to evict:
 1) Objects at the bottom of the inactive list are selected to evict. They are 
 removed from the inactive list.
 2) If the number of the objects in the inactive list becomes low, some of the 
 objects at the bottom of the active list are moved to the inactive list. For 
 those objects which have the referenced flag set, they are given one more 
 chance in the active list. They are moved to the top of the active list with 
 the referenced flag cleared. For those objects which don't have the 
 referenced flag set, they are moved to the inactive list, with the referenced 
 flag set. So that they can be quickly promoted to the active list when 
 necessary.
 
 ## Combine flush with eviction
 - When evicting an object, if it's dirty, it's flushed first. After flushing, 
 it's evicted. If not dirty, it's evicted directly.
 - This means that we won't have separate activities and won't set different 
 ratios for flush and evict. Is there a need to do so?
 - Number of objects to evict at a time. 'evict_effort' acts as the priority, 
 which is used to calculate the number of objects to evict.

As someone else mentioned in a follow-up, the reason we let the dirty 
level be set lower than the full level is that it provides headroom so 
that objects can be quickly evicted (delete, no flush) to make room for 
new writes or new promotions.

That said, we probably can/should streamline the flush so that an evict 
can immediately follow without waiting for the agent to come around again.  
(I don't think we do that now?)

sage

 
 ## LRU lists Snapshotting
 - The two lists are snapshotted persisted periodically.
 - Only one copy needs to be saved. The old copy is removed when persisting 
 the lists. The saved lists are used to restore the LRU lists when OSD reboots.
 
 Any comments/feedbacks are welcomed.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

RE: The design of the eviction improvement

2015-07-20 Thread Allen Samuels
This seems much better than the current mechanism. Do you have an estimate of 
the memory consumption of the two lists? (In terms of bytes/object?)


Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
Sent: Monday, July 20, 2015 1:47 AM
To: Sage Weil; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: The design of the eviction improvement

Hi all,

This is a follow-up of one of the CDS session at 
http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
 We discussed the drawbacks of the current eviction algorithm and several ways 
to improve it. Seems like the LRU variants is the right way to go. I come up 
with some design points after the CDS, and want to discuss it with you. It is 
an approximate 2Q algorithm, combining some benefits of the clock algorithm, 
similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, 
which limits the max number of elements in the list. This mostly looks like a 
MRU, though its name implies they are LRUs. Since the object size may vary in a 
PG, it's not possible to caculate the total number of objects which the cache 
tier can hold ahead of time. We need a new LRU implementation with no limit on 
the size.
- Two lists for each PG: active and inactive Objects are first put into the 
inactive list when they are accessed, and moved between these two lists based 
on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced 
flag is cleared, and it's removed from the inactive list, and put on top of the 
active list.
4) If it's in the active list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to 
the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are 
removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the 
objects at the bottom of the active list are moved to the inactive list. For 
those objects which have the referenced flag set, they are given one more 
chance in the active list. They are moved to the top of the active list with 
the referenced flag cleared. For those objects which don't have the referenced 
flag set, they are moved to the inactive list, with the referenced flag set. So 
that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, 
it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different 
ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, 
which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the 
lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-20 Thread Wang, Zhiqiang
Hi Nick,

 -Original Message-
 From: Nick Fisk [mailto:n...@fisk.me.uk]
 Sent: Monday, July 20, 2015 5:28 PM
 To: Wang, Zhiqiang; 'Sage Weil'; sj...@redhat.com;
 ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 Hi,
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
  ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
  Sent: 20 July 2015 09:47
  To: Sage Weil sw...@redhat.com; sj...@redhat.com; ceph-
  de...@vger.kernel.org
  Subject: The design of the eviction improvement
 
  Hi all,
 
  This is a follow-up of one of the CDS session at
  http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
  tiering_eviction. We discussed the drawbacks of the current eviction
  algorithm and several ways to improve it. Seems like the LRU variants
  is
 the
  right way to go. I come up with some design points after the CDS, and
  want to discuss it with you. It is an approximate 2Q algorithm,
  combining some benefits of the clock algorithm, similar to what the
  linux kernel does for
 the
  page cache.
 
  # Design points:
 
  ## LRU lists
  - Maintain LRU lists at the PG level.
  The SharedLRU and SimpleLRU implementation in the current code have a
  max_size, which limits the max number of elements in the list. This
  mostly looks like a MRU, though its name implies they are LRUs. Since
  the object
 size
  may vary in a PG, it's not possible to caculate the total number of
 objects
  which the cache tier can hold ahead of time. We need a new LRU
  implementation with no limit on the size.
  - Two lists for each PG: active and inactive Objects are first put
  into
 the
  inactive list when they are accessed, and moved between these two
  lists based on some criteria.
  Object flag: active, referenced, unevictable, dirty.
  - When an object is accessed:
  1) If it's not in both of the lists, it's put on the top of the
  inactive
 list
  2) If it's in the inactive list, and the referenced flag is not set,
  the
 referenced
  flag is set, and it's moved to the top of the inactive list.
  3) If it's in the inactive list, and the referenced flag is set, the
 referenced flag
  is cleared, and it's removed from the inactive list, and put on top of
  the
 active
  list.
  4) If it's in the active list, and the referenced flag is not set, the
 referenced
  flag is set, and it's moved to the top of the active list.
  5) If it's in the active list, and the referenced flag is set, it's
  moved
 to the top
  of the active list.
  - When selecting objects to evict:
  1) Objects at the bottom of the inactive list are selected to evict.
  They
 are
  removed from the inactive list.
  2) If the number of the objects in the inactive list becomes low, some
  of
 the
  objects at the bottom of the active list are moved to the inactive list.
 For
  those objects which have the referenced flag set, they are given one
  more chance in the active list. They are moved to the top of the
  active list
 with the
  referenced flag cleared. For those objects which don't have the
  referenced flag set, they are moved to the inactive list, with the
  referenced flag
 set. So
  that they can be quickly promoted to the active list when necessary.
 
 
 I really like this idea but just out of interest, there must be a point where 
 the
 overhead of managing much larger lists of very cold objects starts to impact 
 on
 the gains of having exactly the right objects in each tier. If 90% of your 
 hot IO is
 in 10% of the total data, how much extra benefit would you get by tracking all
 objects vs just tracking the top 10,20,30%...etc and evicting randomly after
 that?  If these objects are being accessed infrequently, the impact of
 re-promoting is probably minimal and if the promotion code can get to a point
 where it is being a bit more intelligent about what objects are promoted then
 this is probably even more so?

The idea is that the lists only hold the objects in the cache tier. For those 
objects which are cold enough, it's evicted from the cache tier and removed 
from the lists. Also, the lists are maintained at the PG level. I guess the 
lists won't be too extremely large? In your example of the 90%/10% data access, 
it may be right that randomly evicting the 90% cold data is good enough. But we 
need a way to know what the 10% of the hot data are. Also, we can't assume the 
90%/10% pattern for every workload.

 
  ## Combine flush with eviction
  - When evicting an object, if it's dirty, it's flushed first. After
 flushing, it's
  evicted. If not dirty, it's evicted directly.
  - This means that we won't have separate activities and won't set
 different
  ratios for flush and evict. Is there a need to do so?
  - Number of objects to evict at a time. 'evict_effort' acts as the
 priority, which
  is used to calculate the number of objects to evict.
 
 
 I think you want to flush at a lower limit because flushes can be more 
 expensive
 that evictions

RE: The design of the eviction improvement

2015-07-20 Thread Wang, Zhiqiang


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Tuesday, July 21, 2015 6:38 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: Re: The design of the eviction improvement
 
 On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
  Hi all,
 
  This is a follow-up of one of the CDS session at
 http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
 ng_eviction. We discussed the drawbacks of the current eviction algorithm and
 several ways to improve it. Seems like the LRU variants is the right way to 
 go. I
 come up with some design points after the CDS, and want to discuss it with 
 you.
 It is an approximate 2Q algorithm, combining some benefits of the clock
 algorithm, similar to what the linux kernel does for the page cache.
 
 Unfortunately I missed this last CDS so I'm behind on the discussion.  I have 
 a
 few questions though...
 
  # Design points:
 
  ## LRU lists
  - Maintain LRU lists at the PG level.
  The SharedLRU and SimpleLRU implementation in the current code have a
  max_size, which limits the max number of elements in the list. This
  mostly looks like a MRU, though its name implies they are LRUs. Since
  the object size may vary in a PG, it's not possible to caculate the
  total number of objects which the cache tier can hold ahead of time.
  We need a new LRU implementation with no limit on the size.
 
 This last sentence seems to me to be the crux of it.  Assuming we have an
 OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
 O(n) objects in memory.  The current hitset-based approach was taken based
 on the assumption that this wasn't feasible--or at least we didn't know how to
 implmement such a thing.  If it is, or we simply want to stipulate that cache
 tier OSDs get gobs of RAM to make it possible, then lots of better options
 become possible...
 
 Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
 1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
 that's 100MB... so not so unreasonable, perhaps!

I was having the same question before proposing this. I did the similar 
calculation and thought it would be ok to use this many memory :-)

 
  - Two lists for each PG: active and inactive Objects are first put
  into the inactive list when they are accessed, and moved between these two
 lists based on some criteria.
  Object flag: active, referenced, unevictable, dirty.
  - When an object is accessed:
  1) If it's not in both of the lists, it's put on the top of the
  inactive list
  2) If it's in the inactive list, and the referenced flag is not set, the 
  referenced
 flag is set, and it's moved to the top of the inactive list.
  3) If it's in the inactive list, and the referenced flag is set, the 
  referenced flag
 is cleared, and it's removed from the inactive list, and put on top of the 
 active
 list.
  4) If it's in the active list, and the referenced flag is not set, the 
  referenced
 flag is set, and it's moved to the top of the active list.
  5) If it's in the active list, and the referenced flag is set, it's moved 
  to the top
 of the active list.
  - When selecting objects to evict:
  1) Objects at the bottom of the inactive list are selected to evict. They 
  are
 removed from the inactive list.
  2) If the number of the objects in the inactive list becomes low, some of 
  the
 objects at the bottom of the active list are moved to the inactive list. For 
 those
 objects which have the referenced flag set, they are given one more chance in
 the active list. They are moved to the top of the active list with the 
 referenced
 flag cleared. For those objects which don't have the referenced flag set, they
 are moved to the inactive list, with the referenced flag set. So that they 
 can be
 quickly promoted to the active list when necessary.
 
  ## Combine flush with eviction
  - When evicting an object, if it's dirty, it's flushed first. After 
  flushing, it's
 evicted. If not dirty, it's evicted directly.
  - This means that we won't have separate activities and won't set different
 ratios for flush and evict. Is there a need to do so?
  - Number of objects to evict at a time. 'evict_effort' acts as the 
  priority, which
 is used to calculate the number of objects to evict.
 
 As someone else mentioned in a follow-up, the reason we let the dirty level be
 set lower than the full level is that it provides headroom so that objects 
 can be
 quickly evicted (delete, no flush) to make room for new writes or new
 promotions.
 
 That said, we probably can/should streamline the flush so that an evict can
 immediately follow without waiting for the agent to come around again.
 (I don't think we do that now?)

I was afraid of having to maintain different lists for flush/evict. So I 
proposed to combine flush with evict. But seems like we don't need to. When 
reaching