Re: Breaks Replaces in debian/control in backports

2015-07-22 Thread Ken Dreyer
On 07/19/2015 05:28 AM, Loic Dachary wrote:

 I think it achieves the same thing and is less error prone in the case of 
 backports. The risk is that upgrading from v0.94.2-34 to the version with 
 this change will fail because the conditions are satisified (it thinks all 
 versions after v0.94.2 have the change). But the odds of having a test 
 machine with this specific version already installed are close to 
 non-existent. The odds of us picking the wrong number and ending up with 
 something that's either too high or too small are higher.
 
 What do you think ?
 

I think this is great, thanks for proposing it. We should also write our
convention down someplace (SubmittingPatches, or the wiki, or something)

- Ken
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Allen Samuels wrote:
 I'm very concerned about designing around the assumption that objects 
 are ~1MB in size. That's probably a good assumption for block and HDFS 
 dominated systems, but likely a very poor assumption about many object 
 and file dominated systems.
 
 If I understand the proposals that have been discussed, each of them 
 assumes in in-memory data structure with an entry per object (the exact 
 size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of 
 graceful degradation as the object counts grow and the in-memory data 
 structures get larger. Everything seems fine until just a few objects 
 get added then the system starts to page and performance drops 
 dramatically (likely) to the point where Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into 
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism 
 ought to be able to accomplish what we want. Some amount of the data 
 structure (the most likely to be used) is held in DRAM [and backed to 
 storage for restart] and the least likely to be used is flushed to 
 storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object - atime.

We keep a simple LRU of the top N objects in memory with the object-atime 
values.  When an object is accessed, it is moved or added to the top 
of the list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb 
in a simple object - atime fashion

When the agent runs, we just walk across that key range of the db the same 
way we currently enumerate objects.  For each record we use either the 
stored atime or the value in the in-memory LRU (it'll need to be 
dual-indexed by both a list and a hash map), whichever is newer.  We can 
use the same histogram estimation approach we do now to determine if the 
object in question is below the flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it 
has to do in check.

sage

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load and
   manage such a list.  Assuming it is several hundred MB, it'll take a
   while to load that into memory and set up all the pointers (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to say,
  we have an active/inactive list for every PG. We can load the lists in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and have
   the next/previous pointers be a (slab,position) pair.  That way we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead of
   the full object name.  We could continue to do the same, or we could
   do a hash pair (hobject_t hash + a different hash of the rest of the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's.  A
   link could be a slab + position (2 more u32's), and if we have prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have the
  same hobject hash. Thus we have to use the hash pair instead of just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say, is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I think 
 we can either 

building just src/tools/rados

2015-07-22 Thread Deneau, Tom
Is there a make command that would build just the src/tools or even just 
src/tools/rados ?

-- Tom Deneau

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
I'm very concerned about designing around the assumption that objects are ~1MB 
in size. That's probably a good assumption for block and HDFS dominated 
systems, but likely a very poor assumption about many object and file dominated 
systems.

If I understand the proposals that have been discussed, each of them assumes in 
in-memory data structure with an entry per object (the exact size of the entry 
varies with the different proposals).

Under that assumption, I have another concern which is the lack of graceful 
degradation as the object counts grow and the in-memory data structures get 
larger. Everything seems fine until just a few objects get added then the 
system starts to page and performance drops dramatically (likely) to the point 
where Linux will start killing OSDs.

What's really needed is some kind of way to extend the lists into storage in 
way that's doesn't cause a zillion I/O operations.

I have some vague idea that some data structure like the LSM mechanism ought to 
be able to accomplish what we want. Some amount of the data structure (the most 
likely to be used) is held in DRAM [and backed to storage for restart] and the 
least likely to be used is flushed to storage with some mechanism that allows 
batched updates.

Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 5:57 AM
To: Wang, Zhiqiang
Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
  The part that worries me now is the speed with which we can load and
  manage such a list.  Assuming it is several hundred MB, it'll take a
  while to load that into memory and set up all the pointers (assuming
  a conventional linked list structure).  Maybe tens of seconds...

 I'm thinking of maintaining the lists at the PG level. That's to say,
 we have an active/inactive list for every PG. We can load the lists in
 parallel during rebooting. Also, the ~100 MB lists are split among
 different OSD nodes. Perhaps it does not need such long time to load
 them?

 
  I wonder if instead we should construct some sort of flat model
  where we load slabs of contiguous memory, 10's of MB each, and have
  the next/previous pointers be a (slab,position) pair.  That way we
  can load it into memory in big chunks, quickly, and be able to
  operate on it (adjust links) immediately.
 
  Another thought: currently we use the hobject_t hash only instead of
  the full object name.  We could continue to do the same, or we could
  do a hash pair (hobject_t hash + a different hash of the rest of the
  object) to keep the representation compact.  With a model lke the
  above, that could get the object representation down to 2 u32's.  A
  link could be a slab + position (2 more u32's), and if we have prev
  + next that'd be just 6x4=24 bytes per object.

 Looks like for an object, the head and the snapshot version have the
 same hobject hash. Thus we have to use the hash pair instead of just
 the hobject hash. But I still have two questions if we use the hash
 pair to represent an object.

 1) Does the hash pair uniquely identify an object? That's to say, is
 it possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

 2) We need a way to get the full object name from the hash pair, so
 that we know what objects to evict. But seems like we don't have a
 good way to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we 
can either embed the full ghobject_t (which means we lose the fixed-size 
property, and the per-object overhead goes way up.. probably from ~24 bytes to 
more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) 
hash position to find the object.  That's somewhat inefficient for FileStore 
(it'll list a directory of a hundred or so objects, probably, and iterate over 
them to find the right one), but for NewStore it will be quite fast (NewStore 
has all objects sorted into keys in rocksdb, so we just start listing at the 
right offset).  Usually we'll get the object right off, unless there are 
hobject_t hash collisions (already reasonably rare since it's a 2^32 space for 
the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of 

Re: The design of the eviction improvement

2015-07-22 Thread Matt W. Benjamin
Hi,

- Allen Samuels allen.samu...@sandisk.com wrote:

 I'm very concerned about designing around the assumption that objects
 are ~1MB in size. That's probably a good assumption for block and HDFS
 dominated systems, but likely a very poor assumption about many object
 and file dominated systems.

++

 
 If I understand the proposals that have been discussed, each of them
 assumes in in-memory data structure with an entry per object (the
 exact size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of
 graceful degradation as the object counts grow and the in-memory data
 structures get larger. Everything seems fine until just a few objects
 get added then the system starts to page and performance drops
 dramatically (likely) to the point where Linux will start killing
 OSDs.

I'm not clear why that needs to be the case (but don't think it matters just 
now whether I do,
I was just letting folks know that we have MQ implementation(s)), but what 
you're describing seems consistent the model Sage and Greg, at least, are 
describing.

Matt

 
 What's really needed is some kind of way to extend the lists into
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism
 ought to be able to accomplish what we want. Some amount of the data
 structure (the most likely to be used) is held in DRAM [and backed to
 storage for restart] and the least likely to be used is flushed to
 storage with some mechanism that allows batched updates.
 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load
 and
   manage such a list.  Assuming it is several hundred MB, it'll take
 a
   while to load that into memory and set up all the pointers
 (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to
 say,
  we have an active/inactive list for every PG. We can load the lists
 in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to
 load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and
 have
   the next/previous pointers be a (slab,position) pair.  That way
 we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead
 of
   the full object name.  We could continue to do the same, or we
 could
   do a hash pair (hobject_t hash + a different hash of the rest of
 the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's. 
 A
   link could be a slab + position (2 more u32's), and if we have
 prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have
 the
  same hobject hash. Thus we have to use the hash pair instead of
 just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say,
 is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I
 think we can either embed the full ghobject_t (which means we lose the
 fixed-size property, and the per-object overhead goes way up..
 probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate
 objects starting at the (hobject_t) hash position to find the object. 
 That's somewhat inefficient for FileStore (it'll list a directory of a
 hundred or so objects, probably, and iterate over them to find the
 right one), but for NewStore it will be quite fast (NewStore has all
 objects sorted into keys in rocksdb, so we just start listing at the
 right offset).  Usually we'll get the object right off, unless there
 are hobject_t hash collisions (already reasonably rare since it's a
 2^32 space for the pool).
 
 Given that, I would lean toward the 2-hash fixed-sized records (of
 

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Don't we need to double-index the data structure?

We need it indexed by atime for the purposes of eviction, but we need it 
indexed by object name for the purposes of updating the list upon a usage.




Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 11:51 AM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 I'm very concerned about designing around the assumption that objects 
 are ~1MB in size. That's probably a good assumption for block and HDFS 
 dominated systems, but likely a very poor assumption about many object 
 and file dominated systems.
 
 If I understand the proposals that have been discussed, each of them 
 assumes in in-memory data structure with an entry per object (the 
 exact size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of 
 graceful degradation as the object counts grow and the in-memory data 
 structures get larger. Everything seems fine until just a few objects 
 get added then the system starts to page and performance drops 
 dramatically (likely) to the point where Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into 
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism 
 ought to be able to accomplish what we want. Some amount of the data 
 structure (the most likely to be used) is held in DRAM [and backed to 
 storage for restart] and the least likely to be used is flushed to 
 storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object - atime.

We keep a simple LRU of the top N objects in memory with the object-atime 
values.  When an object is accessed, it is moved or added to the top of the 
list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a 
simple object - atime fashion

When the agent runs, we just walk across that key range of the db the same way 
we currently enumerate objects.  For each record we use either the stored atime 
or the value in the in-memory LRU (it'll need to be dual-indexed by both a list 
and a hash map), whichever is newer.  We can use the same histogram estimation 
approach we do now to determine if the object in question is below the 
flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it has to 
do in check.

sage

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load 
   and manage such a list.  Assuming it is several hundred MB, it'll 
   take a while to load that into memory and set up all the pointers 
   (assuming a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to 
  say, we have an active/inactive list for every PG. We can load the 
  lists in parallel during rebooting. Also, the ~100 MB lists are 
  split among different OSD nodes. Perhaps it does not need such long 
  time to load them?
 
  
   I wonder if instead we should construct some sort of flat model 
   where we load slabs of contiguous memory, 10's of MB each, and 
   have the next/previous pointers be a (slab,position) pair.  That 
   way we can load it into memory in big chunks, quickly, and be able 
   to operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead 
   of the full object name.  We could continue to do the same, or we 
   could do a hash pair (hobject_t hash + a different hash of the 
   rest of the
   object) to keep the representation compact.  With a model lke the 
   above, that could get the object representation down to 2 u32's.  
   A link could be a slab + position (2 more u32's), and if we have 
   prev
   + next that'd be just 6x4=24 bytes 

Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-22 Thread Stefan Priebe - Profihost AG

Am 21.07.2015 um 22:50 schrieb Josh Durgin:
 Yes, I'm afraid it sounds like it is. You can double check whether the
 watch exists on an image by getting the id of the image from 'rbd info
 $pool/$image | grep block_name_prefix':
 
 block_name_prefix: rbd_data.105674b0dc51
 
 The id is the hex number there. Append that to 'rbd_header.' and you
 have the header object name. Check whether it has watchers with:
 
 rados listwatchers -p $pool rbd_header.105674b0dc51
 
 If that doesn't show any watchers while the image is in use by a vm,
 it's #9806.

Yes it does not show any watchers.

 I just merged the backport for firefly, so it'll be in 0.80.11.
 Sorry it took so long to get to firefly :(. We'll need to be
 more vigilant about checking non-trivial backports when we're
 going through all the bugs periodically.

That would be really important. I've seen that this one was already in
upstream/firefly-backports. What's the purpose of that branch?

Greets,
Stefan

 Josh
 
 On 07/21/2015 12:52 PM, Stefan Priebe wrote:
 So this is really this old bug?

 http://tracker.ceph.com/issues/9806

 Stefan
 Am 21.07.2015 um 21:46 schrieb Josh Durgin:
 On 07/21/2015 12:22 PM, Stefan Priebe wrote:

 Am 21.07.2015 um 19:19 schrieb Jason Dillaman:
 Does this still occur if you export the images to the console (i.e.
 rbd export cephstor/disk-116@snap -  dump_file)?

 Would it be possible for you to provide logs from the two rbd export
 runs on your smallest VM image?  If so, please add the following to
 the [client] section of your ceph.conf:

log file = /valid/path/to/logs/$name.$pid.log
debug rbd = 20

 I opened a ticket [1] where you can attach the logs (if they aren't
 too large).

 [1] http://tracker.ceph.com/issues/12422

 Will post some more details to the tracker in a few hours. It seems it
 is related to using discard inside guest but not on the FS the osd is
 on.

 That sounds very odd. Could you verify via 'rados listwatchers' on an
 in-use rbd image's header object that there's still a watch established?

 Have you increased pgs in all those clusters recently?

 Josh
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: quick way to rebuild deb packages

2015-07-22 Thread Loic Dachary
Hi,

Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would 
recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh 
first to make sure you can build and test: this will install the dependencies 
you're missing at the same time.

Cheers

On 21/07/2015 18:15, Bartłomiej Święcki wrote:
 Hi all,
 
 I'm currently working on a test environment for ceph where we're using deb 
 files to deploy new version on test cluster.
 To make this work efficiently I'd have to quckly build deb packages.
 
 I tried dpkg-buildpackages -nc which should keep the results of previous 
 build but it ends up in a linking error:
 
 ...
   CXXLDceph_rgw_jsonparser
 ./.libs/libglobal.a(json_spirit_reader.o): In function 
 `~thread_specific_ptr':
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79:
  more undefined references to `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow
 ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid 
 (*)()':
 ...
 
 Any ideas on what could go wrong here ?
 
 Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1.
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RE: quick way to rebuild deb packages

2015-07-22 Thread Zhou, Yuan
I'm also using make-debs.sh to generate the binaries for some local deployment. 
Note that if you need the *tests.deb you'll need to change this scripts a bit.

@@ -58,8 +58,8 @@ tar -C $releasedir -zxf $releasedir/ceph_$vers.orig.tar.gz
 #
 cp -a debian $releasedir/ceph-$vers/debian
 cd $releasedir
-perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control
-perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules
+#perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control
+#perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules
 #
 # always set the debian version to 1 which is ok because the debian
 # directory is included in the sources and the upstream version will 



-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
Sent: Wednesday, July 22, 2015 2:32 PM
To: Bartłomiej Święcki; ceph-devel@vger.kernel.org
Subject: Re: quick way to rebuild deb packages

Hi,

Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would 
recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh 
first to make sure you can build and test: this will install the dependencies 
you're missing at the same time.

Cheers

On 21/07/2015 18:15, Bartłomiej Święcki wrote:
 Hi all,
 
 I'm currently working on a test environment for ceph where we're using deb 
 files to deploy new version on test cluster.
 To make this work efficiently I'd have to quckly build deb packages.
 
 I tried dpkg-buildpackages -nc which should keep the results of previous 
 build but it ends up in a linking error:
 
 ...
   CXXLDceph_rgw_jsonparser
 ./.libs/libglobal.a(json_spirit_reader.o): In function 
 `~thread_specific_ptr':
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79:
  more undefined references to `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow
 ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid 
 (*)()':
 ...
 
 Any ideas on what could go wrong here ?
 
 Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1.
 

-- 
Loïc Dachary, Artisan Logiciel Libre

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Inactive PGs should trigger a HEALTH_ERR state

2015-07-22 Thread Wido den Hollander
Hi,

I was just testing with a cluster on VMs and I noticed that
undersized+degraded+peering PGs do not trigger a HEALTH_ERR state. Why
is that?

In my opinion any PG which is not active+? should trigger a HEALTH_ERR
state since I/O is blocking at that point.

Is that a sane thing to do or am I missing something?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: upstream/firefly exporting the same snap 2 times results in different exports

2015-07-22 Thread Nathan Cutler
On 2015-07-22 09:03, Stefan Priebe - Profihost AG wrote:
 That would be really important. I've seen that this one was already in
 upstream/firefly-backports. What's the purpose of that branch?

That is where the Stable Releases and Backports team stages backports
and does integration testing on them before they are merged into the
'firefly' named branch.

-- 
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: quick way to rebuild deb packages

2015-07-22 Thread Bartłomiej Święcki
I'll definitely take a look at make-debs.sh, looks promising. Thanks for the 
hint.

I can see it's using ccache, let's see how fast it is :) What build times are 
you experiencing ?

On Wed, 22 Jul 2015 08:04:44 +
Zhou, Yuan yuan.z...@intel.com wrote:

 I'm also using make-debs.sh to generate the binaries for some local 
 deployment. Note that if you need the *tests.deb you'll need to change this 
 scripts a bit.
 
 @@ -58,8 +58,8 @@ tar -C $releasedir -zxf $releasedir/ceph_$vers.orig.tar.gz
  #
  cp -a debian $releasedir/ceph-$vers/debian
  cd $releasedir
 -perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' 
 ceph-$vers/debian/control
 -perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules
 +#perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' 
 ceph-$vers/debian/control
 +#perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules
  #
  # always set the debian version to 1 which is ok because the debian
  # directory is included in the sources and the upstream version will 
 
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
 Sent: Wednesday, July 22, 2015 2:32 PM
 To: Bartłomiej Święcki; ceph-devel@vger.kernel.org
 Subject: Re: quick way to rebuild deb packages
 
 Hi,
 
 Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would 
 recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh 
 first to make sure you can build and test: this will install the dependencies 
 you're missing at the same time.
 
 Cheers
 
 On 21/07/2015 18:15, Bartłomiej Święcki wrote:
  Hi all,
  
  I'm currently working on a test environment for ceph where we're using deb 
  files to deploy new version on test cluster.
  To make this work efficiently I'd have to quckly build deb packages.
  
  I tried dpkg-buildpackages -nc which should keep the results of previous 
  build but it ends up in a linking error:
  
  ...
CXXLDceph_rgw_jsonparser
  ./.libs/libglobal.a(json_spirit_reader.o): In function 
  `~thread_specific_ptr':
  /usr/include/boost/thread/tss.hpp:79: undefined reference to 
  `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
  /usr/include/boost/thread/tss.hpp:79: undefined reference to 
  `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
  /usr/include/boost/thread/tss.hpp:79: undefined reference to 
  `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
  /usr/include/boost/thread/tss.hpp:79: undefined reference to 
  `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
  /usr/include/boost/thread/tss.hpp:79: undefined reference to 
  `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
  ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79:
   more undefined references to `boost::detail::set_tss_data(void const*, 
  boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' 
  follow
  ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid 
  (*)()':
  ...
  
  Any ideas on what could go wrong here ?
  
  Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1.
  
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 


-- 
Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inactive PGs should trigger a HEALTH_ERR state

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Wido den Hollander wrote:
 Hi,
 
 I was just testing with a cluster on VMs and I noticed that
 undersized+degraded+peering PGs do not trigger a HEALTH_ERR state. Why
 is that?
 
 In my opinion any PG which is not active+? should trigger a HEALTH_ERR
 state since I/O is blocking at that point.
 
 Is that a sane thing to do or am I missing something?

IIRC they trigger a WARN state until they are 'stuck' inactive, at which 
point they trigger an ERR state.  The idea is that it is totally normal 
for PGs to be in an inactive state for short periods due to normal cluster 
churn--it's only problematic if they get stuck there.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
  The part that worries me now is the speed with which we can load and 
  manage such a list.  Assuming it is several hundred MB, it'll take a 
  while to load that into memory and set up all the pointers (assuming a 
  conventional linked list structure).  Maybe tens of seconds...
 
 I'm thinking of maintaining the lists at the PG level. That's to say, we 
 have an active/inactive list for every PG. We can load the lists in 
 parallel during rebooting. Also, the ~100 MB lists are split among 
 different OSD nodes. Perhaps it does not need such long time to load 
 them?
 
  
  I wonder if instead we should construct some sort of flat model where 
  we load slabs of contiguous memory, 10's of MB each, and have the 
  next/previous pointers be a (slab,position) pair.  That way we can 
  load it into memory in big chunks, quickly, and be able to operate on 
  it (adjust links) immediately.
  
  Another thought: currently we use the hobject_t hash only instead of 
  the full object name.  We could continue to do the same, or we could 
  do a hash pair (hobject_t hash + a different hash of the rest of the 
  object) to keep the representation compact.  With a model lke the 
  above, that could get the object representation down to 2 u32's.  A 
  link could be a slab + position (2 more u32's), and if we have prev + 
  next that'd be just 6x4=24 bytes per object.
 
 Looks like for an object, the head and the snapshot version have the 
 same hobject hash. Thus we have to use the hash pair instead of just the 
 hobject hash. But I still have two questions if we use the hash pair to 
 represent an object.

 1) Does the hash pair uniquely identify an object? That's to say, is it 
 possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

 2) We need a way to get the full object name from the hash pair, so that 
 we know what objects to evict. But seems like we don't have a good way 
 to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I 
think we can either embed the full ghobject_t (which means we lose the 
fixed-size property, and the per-object overhead goes way up.. probably 
from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects 
starting at the (hobject_t) hash position to find the object.  That's 
somewhat inefficient for FileStore (it'll list a directory of a hundred or 
so objects, probably, and iterate over them to find the right one), but 
for NewStore it will be quite fast (NewStore has all objects sorted into 
keys in rocksdb, so we just start listing at the right offset).  Usually 
we'll get the object right off, unless there are hobject_t hash collisions 
(already reasonably rare since it's a 2^32 space for the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: The design of the eviction improvement

2015-07-22 Thread Wang, Zhiqiang
Hi Allen,

 -Original Message-
 From: Allen Samuels [mailto:allen.samu...@sandisk.com]
 Sent: Thursday, July 23, 2015 2:41 AM
 To: Sage Weil; Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 I'm very concerned about designing around the assumption that objects are
 ~1MB in size. That's probably a good assumption for block and HDFS dominated
 systems, but likely a very poor assumption about many object and file
 dominated systems.

This is true. If we have lots of small objects/files, the memory used for LRU 
lists could be extremely large.

 
 If I understand the proposals that have been discussed, each of them assumes
 in in-memory data structure with an entry per object (the exact size of the
 entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of graceful
 degradation as the object counts grow and the in-memory data structures get
 larger. Everything seems fine until just a few objects get added then the 
 system
 starts to page and performance drops dramatically (likely) to the point where
 Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into storage in 
 way
 that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism
 ought to be able to accomplish what we want. Some amount of the data
 structure (the most likely to be used) is held in DRAM [and backed to storage
 for restart] and the least likely to be used is flushed to storage with some
 mechanism that allows batched updates.

The LSM mechanism could solve the memory consumption problem. But I guess the 
process to choose which objects to evict is complex and inefficient. Also, 
after evicting some objects, we need to update the on-disk file to remove the 
entries of these objects. This is inefficient, too.

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load and
   manage such a list.  Assuming it is several hundred MB, it'll take a
   while to load that into memory and set up all the pointers (assuming
   a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to say,
  we have an active/inactive list for every PG. We can load the lists in
  parallel during rebooting. Also, the ~100 MB lists are split among
  different OSD nodes. Perhaps it does not need such long time to load
  them?
 
  
   I wonder if instead we should construct some sort of flat model
   where we load slabs of contiguous memory, 10's of MB each, and have
   the next/previous pointers be a (slab,position) pair.  That way we
   can load it into memory in big chunks, quickly, and be able to
   operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead of
   the full object name.  We could continue to do the same, or we could
   do a hash pair (hobject_t hash + a different hash of the rest of the
   object) to keep the representation compact.  With a model lke the
   above, that could get the object representation down to 2 u32's.  A
   link could be a slab + position (2 more u32's), and if we have prev
   + next that'd be just 6x4=24 bytes per object.
 
  Looks like for an object, the head and the snapshot version have the
  same hobject hash. Thus we have to use the hash pair instead of just
  the hobject hash. But I still have two questions if we use the hash
  pair to represent an object.
 
  1) Does the hash pair uniquely identify an object? That's to say, is
  it possible for two objects to have the same hash pair?
 
 With two hashes collisions would be rare but could happen
 
  2) We need a way to get the full object name from the hash pair, so
  that we know what objects to evict. But seems like we don't have a
  good way to do this?
 
 Ah, yeah--I'm a little stuck in the current hitset view of things.  I think 
 we can
 either embed the full ghobject_t (which means we lose the fixed-size property,
 and the per-object overhead goes way up.. probably from ~24 bytes to more
 like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) 
 hash
 position to find the object.  That's somewhat inefficient for FileStore 
 (it'll list a
 directory of a hundred or so objects, probably, and iterate over them to find 
 the
 right one), but for NewStore it will be quite 

RE: The design of the eviction improvement

2015-07-22 Thread Wang, Zhiqiang
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Thursday, July 23, 2015 2:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that objects
  are ~1MB in size. That's probably a good assumption for block and HDFS
  dominated systems, but likely a very poor assumption about many object
  and file dominated systems.
 
  If I understand the proposals that have been discussed, each of them
  assumes in in-memory data structure with an entry per object (the
  exact size of the entry varies with the different proposals).
 
  Under that assumption, I have another concern which is the lack of
  graceful degradation as the object counts grow and the in-memory data
  structures get larger. Everything seems fine until just a few objects
  get added then the system starts to page and performance drops
  dramatically (likely) to the point where Linux will start killing OSDs.
 
  What's really needed is some kind of way to extend the lists into
  storage in way that's doesn't cause a zillion I/O operations.
 
  I have some vague idea that some data structure like the LSM mechanism
  ought to be able to accomplish what we want. Some amount of the data
  structure (the most likely to be used) is held in DRAM [and backed to
  storage for restart] and the least likely to be used is flushed to
  storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly loaded
  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a
 simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same
 way we currently enumerate objects.  For each record we use either the
 stored atime or the value in the in-memory LRU (it'll need to be dual-indexed 
 by
 both a list and a hash map), whichever is newer.  We can use the same
 histogram estimation approach we do now to determine if the object in
 question is below the flush/evict threshold.

This looks similar to what we do now, except it keeps a LRU of the 
object-atime mapping in RAM, instead of using a hitset. The object age 
calculated using the atime would be more accurate than the current hitset and 
mtime approach.

One comment is that I think if we can find a record of an object in the 
in-memory LRU list, we don't need to query the DB since the atime in the LRU 
list is for sure newer than that in the db (if it has).

My concern on this approach is whether the evict decision made by the histogram 
estimation approach is good enough. It only measures 'recency'. And it made the 
decision based on some threshold, not starting from the oldest. In contrast, 
most of the practical algorithms made the decision based on both 'recency' and 
'frequency' (accessed once recently vs. accessed twice or more recently).

If we believe the histogram estimation approach is good enough, I think we can 
easily integrate the idea above with 2Q.
1) The in-memory LRU lists are the same as 2Q. i.e., there are active/inactive 
lists, and the movements between the two lists are the same as what I stated in 
the original mail. But we have a limit of the size of the lists. Only the top N 
hottest objects are kept in the lists.
2) When the size of the lists exceed N*(1.x), evict the oldest items (N .. 
N*1.x) to db store.
3) N could be dynamically adjusted based on the average size of objects in the 
PG.
4) Evict decision are made by the histogram estimation approach. Plus I think 
we want to evict those objects which are not in the in-memory lists first.

 
 The LSM does the work of sorting/compacting the atime info, while we avoid
 touching it at all for the hottest objects to keep the amount of work it has 
 to do
 in check.
 
 sage
 
 
  Allen Samuels
  Software Architect, Systems and Software Solutions
 
  2880 Junction Avenue, San Jose, CA 95134
  T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Wednesday, July 22, 2015 5:57 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: RE: The design of the eviction improvement
 
  On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
The part that worries me now is the speed with which we can load
and manage such a list.  Assuming it is several hundred MB, it'll
take a while to load that into memory and set up all the pointers

RE: building just src/tools/rados

2015-07-22 Thread Podoski, Igor
Hi Tom,

Have you tried cd src; make rados?

Regards,
Igor.


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom
Sent: Wednesday, July 22, 2015 10:13 PM
To: ceph-devel
Subject: building just src/tools/rados

Is there a make command that would build just the src/tools or even just 
src/tools/rados ?

-- Tom Deneau

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About Fio backend with ObjectStore API

2015-07-22 Thread Casey Bodley
Hi Haomai,

Sorry for the late response, I was out of the office. I'm afraid I haven't run 
into that segfault. The io_ops should be set at the very beginning when it 
calls get_ioengine(). All I can suggest is that you verify that your job file 
is pointing to the correct fio_ceph_objectstore.so. If you've made any other 
interesting changes to the job file, could you share it here?

Casey

- Original Message -
From: Haomai Wang haomaiw...@gmail.com
To: Casey Bodley cbod...@gmail.com
Cc: Matt W. Benjamin m...@cohortfs.com, James (Fei) Liu-SSI 
james@ssi.samsung.com, ceph-devel@vger.kernel.org
Sent: Tuesday, July 21, 2015 7:50:32 AM
Subject: Re: About Fio backend with ObjectStore API

Hi Casey,

I check your commits and know what you fixed. I cherry-picked your new
commits but I still met the same problem.


It's strange that it alwasys hit segment fault when entering
_fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
when I up the stack, the td-io_ops is not null. Maybe it's related
to dlopen?


Do you have any hint about this?

On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote:
 Hi Haomai,

 I was able to run this after a couple changes to the filestore.fio job
 file. Two of the config options were using the wrong names. I pushed a
 fix for the job file, as well as a patch that renames everything from
 filestore to objectstore (thanks James), to
 https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore.

 I found that the read support doesn't appear to work anymore, so give
 rw=write a try. And because it does a mkfs(), make sure you're
 pointing it to an empty xfs directory with the directory= option.

 Casey

 On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Anyone who have successfully ran the fio with this external io engine
 ceph_objectstore?

 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?

 On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf-set_val(debug_filestore, 20);
  218 //g_ceph_context-_conf-set_val(debug_throttle, 20);
  219 g_ceph_context-_conf-apply_changes(NULL);
  220
 
  221 ceph_filestore_data-osd_path =
 strdup(/mnt/fio_ceph_filestore.XXX);
  222 ceph_filestore_data-journal_path =
 strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX);
  223
 
  224 if (!mkdtemp(ceph_filestore_data-osd_path)) {
  225 cout  

RE: The design of the eviction improvement

2015-07-22 Thread Sage Weil
On Wed, 22 Jul 2015, Allen Samuels wrote:
 Don't we need to double-index the data structure?
 
 We need it indexed by atime for the purposes of eviction, but we need it 
 indexed by object name for the purposes of updating the list upon a 
 usage.

If you use the same approach the agent uses now (iterate over items, 
evict/trim anything in bottom end of observed age distribution) you can 
get away without the double-index.  Iterating over the LSM should be quite 
cheap.  I'd be more worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to 
something like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end 
of the day the memory requirements for a cache-tier OSD are higher and 
inversely proportional to the object size.  We can make the OSD 
flush/evict more aggressively if the memory utilization (due to a high 
object count) gets out of hand as a safety mechanism.  Paying a few extra 
$$ for RAM isn't the end of the world I'm guessing when the performance 
payoff is significant...

sage


  
 
 
 
 Allen Samuels
 Software Architect, Systems and Software Solutions 
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 11:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that objects 
  are ~1MB in size. That's probably a good assumption for block and HDFS 
  dominated systems, but likely a very poor assumption about many object 
  and file dominated systems.
  
  If I understand the proposals that have been discussed, each of them 
  assumes in in-memory data structure with an entry per object (the 
  exact size of the entry varies with the different proposals).
  
  Under that assumption, I have another concern which is the lack of 
  graceful degradation as the object counts grow and the in-memory data 
  structures get larger. Everything seems fine until just a few objects 
  get added then the system starts to page and performance drops 
  dramatically (likely) to the point where Linux will start killing OSDs.
  
  What's really needed is some kind of way to extend the lists into 
  storage in way that's doesn't cause a zillion I/O operations.
  
  I have some vague idea that some data structure like the LSM mechanism 
  ought to be able to accomplish what we want. Some amount of the data 
  structure (the most likely to be used) is held in DRAM [and backed to 
  storage for restart] and the least likely to be used is flushed to 
  storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime 
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly loaded
  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in 
 a simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same 
 way we currently enumerate objects.  For each record we use either the stored 
 atime or the value in the in-memory LRU (it'll need to be dual-indexed by 
 both a list and a hash map), whichever is newer.  We can use the same 
 histogram estimation approach we do now to determine if the object in 
 question is below the flush/evict threshold.
 
 The LSM does the work of sorting/compacting the atime info, while we avoid 
 touching it at all for the hottest objects to keep the amount of work it has 
 to do in check.
 
 sage
 
  
  Allen Samuels
  Software Architect, Systems and Software Solutions
  
  2880 Junction Avenue, San Jose, CA 95134
  T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
  
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org 
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Wednesday, July 22, 2015 5:57 AM
  To: Wang, Zhiqiang
  Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
  Subject: RE: The design of the eviction improvement
  
  On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
The part that worries me now is the speed with which we can load 
and manage such a list.  Assuming it is several hundred MB, it'll 
take a while to load that into memory and set up all the pointers 
(assuming a conventional linked list structure).  Maybe tens of 
seconds...
  
   I'm thinking of maintaining the lists at the PG level. That's to 
   

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Yes the cost of the insertions with the current scheme is probably prohibitive. 
Wouldn't it approach the same amount of time as just having atime turned on in 
the file system? 

My concern about the memory is mostly that we ensure whatever algorithm is 
selected degrades gracefully when you get high counts of small objects. I agree 
that paying $ for RAM that translates into actual performance isn't really a 
problem. It really boils down to your workload and access pattern.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Wednesday, July 22, 2015 2:53 PM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 Don't we need to double-index the data structure?
 
 We need it indexed by atime for the purposes of eviction, but we need 
 it indexed by object name for the purposes of updating the list upon a 
 usage.

If you use the same approach the agent uses now (iterate over items, evict/trim 
anything in bottom end of observed age distribution) you can get away without 
the double-index.  Iterating over the LSM should be quite cheap.  I'd be more 
worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to something 
like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end of the 
day the memory requirements for a cache-tier OSD are higher and inversely 
proportional to the object size.  We can make the OSD flush/evict more 
aggressively if the memory utilization (due to a high object count) gets out of 
hand as a safety mechanism.  Paying a few extra $$ for RAM isn't the end of the 
world I'm guessing when the performance payoff is significant...

sage


  
 
 
 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 11:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that 
  objects are ~1MB in size. That's probably a good assumption for 
  block and HDFS dominated systems, but likely a very poor assumption 
  about many object and file dominated systems.
  
  If I understand the proposals that have been discussed, each of them 
  assumes in in-memory data structure with an entry per object (the 
  exact size of the entry varies with the different proposals).
  
  Under that assumption, I have another concern which is the lack of 
  graceful degradation as the object counts grow and the in-memory 
  data structures get larger. Everything seems fine until just a few 
  objects get added then the system starts to page and performance 
  drops dramatically (likely) to the point where Linux will start killing 
  OSDs.
  
  What's really needed is some kind of way to extend the lists into 
  storage in way that's doesn't cause a zillion I/O operations.
  
  I have some vague idea that some data structure like the LSM 
  mechanism ought to be able to accomplish what we want. Some amount 
  of the data structure (the most likely to be used) is held in DRAM 
  [and backed to storage for restart] and the least likely to be used 
  is flushed to storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime 
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly 
 loaded
  - write our records for the oldest items (N .. N*1.x) to 
 leveldb/rocksdb in a simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same 
 way we currently enumerate objects.  For each record we use either the stored 
 atime or the value in the in-memory LRU (it'll need to be dual-indexed by 
 both a list and a hash map), whichever is newer.  We can use the same 
 histogram estimation approach we do now to determine if the object in 
 question is below the flush/evict threshold.
 
 The LSM does the work of sorting/compacting the atime info, while we avoid 
 touching it at all for the hottest objects to keep the amount of work it has 
 to do in check.
 
 sage
 
  
  

Re: About Fio backend with ObjectStore API

2015-07-22 Thread Haomai Wang
no special

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=./ceph-int/src/.libs/libfio_ceph_objectstore.so
invalidate=0 # mandatory
rw=write
#bs=4k

[filestore]
iodepth=1
# create a journaled filestore
objectstore=filestore
directory=./osd/
filestore_journal=./osd/journal

On Thu, Jul 23, 2015 at 4:56 AM, Casey Bodley cbod...@redhat.com wrote:
 Hi Haomai,

 Sorry for the late response, I was out of the office. I'm afraid I haven't 
 run into that segfault. The io_ops should be set at the very beginning when 
 it calls get_ioengine(). All I can suggest is that you verify that your job 
 file is pointing to the correct fio_ceph_objectstore.so. If you've made any 
 other interesting changes to the job file, could you share it here?

 Casey

 - Original Message -
 From: Haomai Wang haomaiw...@gmail.com
 To: Casey Bodley cbod...@gmail.com
 Cc: Matt W. Benjamin m...@cohortfs.com, James (Fei) Liu-SSI 
 james@ssi.samsung.com, ceph-devel@vger.kernel.org
 Sent: Tuesday, July 21, 2015 7:50:32 AM
 Subject: Re: About Fio backend with ObjectStore API

 Hi Casey,

 I check your commits and know what you fixed. I cherry-picked your new
 commits but I still met the same problem.

 
 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?
 

 Do you have any hint about this?

 On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote:
 Hi Haomai,

 I was able to run this after a couple changes to the filestore.fio job
 file. Two of the config options were using the wrong names. I pushed a
 fix for the job file, as well as a patch that renames everything from
 filestore to objectstore (thanks James), to
 https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore.

 I found that the read support doesn't appear to work anymore, so give
 rw=write a try. And because it does a mkfs(), make sure you're
 pointing it to an empty xfs directory with the directory= option.

 Casey

 On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Anyone who have successfully ran the fio with this external io engine
 ceph_objectstore?

 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?

 On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com 
 wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 

07/22/2015 Weekly Ceph Performance Meeting IS ON!

2015-07-22 Thread Mark Nelson
8AM PST as usual!  Topics today include a new ceph_test_rados benchmark 
being added to CBT.  Please feel free to add your own!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html