Re: Breaks Replaces in debian/control in backports
On 07/19/2015 05:28 AM, Loic Dachary wrote: I think it achieves the same thing and is less error prone in the case of backports. The risk is that upgrading from v0.94.2-34 to the version with this change will fail because the conditions are satisified (it thinks all versions after v0.94.2 have the change). But the odds of having a test machine with this specific version already installed are close to non-existent. The odds of us picking the wrong number and ending up with something that's either too high or too small are higher. What do you think ? I think this is great, thanks for proposing it. We should also write our convention down someplace (SubmittingPatches, or the wiki, or something) - Ken -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either
building just src/tools/rados
Is there a make command that would build just the src/tools or even just src/tools/rados ? -- Tom Deneau -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of
Re: The design of the eviction improvement
Hi, - Allen Samuels allen.samu...@sandisk.com wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. ++ If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. I'm not clear why that needs to be the case (but don't think it matters just now whether I do, I was just letting folks know that we have MQ implementation(s)), but what you're describing seems consistent the model Sage and Greg, at least, are describing. Matt What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of
RE: The design of the eviction improvement
Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes
Re: upstream/firefly exporting the same snap 2 times results in different exports
Am 21.07.2015 um 22:50 schrieb Josh Durgin: Yes, I'm afraid it sounds like it is. You can double check whether the watch exists on an image by getting the id of the image from 'rbd info $pool/$image | grep block_name_prefix': block_name_prefix: rbd_data.105674b0dc51 The id is the hex number there. Append that to 'rbd_header.' and you have the header object name. Check whether it has watchers with: rados listwatchers -p $pool rbd_header.105674b0dc51 If that doesn't show any watchers while the image is in use by a vm, it's #9806. Yes it does not show any watchers. I just merged the backport for firefly, so it'll be in 0.80.11. Sorry it took so long to get to firefly :(. We'll need to be more vigilant about checking non-trivial backports when we're going through all the bugs periodically. That would be really important. I've seen that this one was already in upstream/firefly-backports. What's the purpose of that branch? Greets, Stefan Josh On 07/21/2015 12:52 PM, Stefan Priebe wrote: So this is really this old bug? http://tracker.ceph.com/issues/9806 Stefan Am 21.07.2015 um 21:46 schrieb Josh Durgin: On 07/21/2015 12:22 PM, Stefan Priebe wrote: Am 21.07.2015 um 19:19 schrieb Jason Dillaman: Does this still occur if you export the images to the console (i.e. rbd export cephstor/disk-116@snap - dump_file)? Would it be possible for you to provide logs from the two rbd export runs on your smallest VM image? If so, please add the following to the [client] section of your ceph.conf: log file = /valid/path/to/logs/$name.$pid.log debug rbd = 20 I opened a ticket [1] where you can attach the logs (if they aren't too large). [1] http://tracker.ceph.com/issues/12422 Will post some more details to the tracker in a few hours. It seems it is related to using discard inside guest but not on the FS the osd is on. That sounds very odd. Could you verify via 'rados listwatchers' on an in-use rbd image's header object that there's still a watch established? Have you increased pgs in all those clusters recently? Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: quick way to rebuild deb packages
Hi, Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh first to make sure you can build and test: this will install the dependencies you're missing at the same time. Cheers On 21/07/2015 18:15, Bartłomiej Święcki wrote: Hi all, I'm currently working on a test environment for ceph where we're using deb files to deploy new version on test cluster. To make this work efficiently I'd have to quckly build deb packages. I tried dpkg-buildpackages -nc which should keep the results of previous build but it ends up in a linking error: ... CXXLDceph_rgw_jsonparser ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr': /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79: more undefined references to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid (*)()': ... Any ideas on what could go wrong here ? Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
RE: quick way to rebuild deb packages
I'm also using make-debs.sh to generate the binaries for some local deployment. Note that if you need the *tests.deb you'll need to change this scripts a bit. @@ -58,8 +58,8 @@ tar -C $releasedir -zxf $releasedir/ceph_$vers.orig.tar.gz # cp -a debian $releasedir/ceph-$vers/debian cd $releasedir -perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control -perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules +#perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control +#perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules # # always set the debian version to 1 which is ok because the debian # directory is included in the sources and the upstream version will -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Wednesday, July 22, 2015 2:32 PM To: Bartłomiej Święcki; ceph-devel@vger.kernel.org Subject: Re: quick way to rebuild deb packages Hi, Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh first to make sure you can build and test: this will install the dependencies you're missing at the same time. Cheers On 21/07/2015 18:15, Bartłomiej Święcki wrote: Hi all, I'm currently working on a test environment for ceph where we're using deb files to deploy new version on test cluster. To make this work efficiently I'd have to quckly build deb packages. I tried dpkg-buildpackages -nc which should keep the results of previous build but it ends up in a linking error: ... CXXLDceph_rgw_jsonparser ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr': /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79: more undefined references to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid (*)()': ... Any ideas on what could go wrong here ? Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1. -- Loïc Dachary, Artisan Logiciel Libre N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Inactive PGs should trigger a HEALTH_ERR state
Hi, I was just testing with a cluster on VMs and I noticed that undersized+degraded+peering PGs do not trigger a HEALTH_ERR state. Why is that? In my opinion any PG which is not active+? should trigger a HEALTH_ERR state since I/O is blocking at that point. Is that a sane thing to do or am I missing something? -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: upstream/firefly exporting the same snap 2 times results in different exports
On 2015-07-22 09:03, Stefan Priebe - Profihost AG wrote: That would be really important. I've seen that this one was already in upstream/firefly-backports. What's the purpose of that branch? That is where the Stable Releases and Backports team stages backports and does integration testing on them before they are merged into the 'firefly' named branch. -- Nathan Cutler Software Engineer Distributed Storage SUSE LINUX, s.r.o. Tel.: +420 284 084 037 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: quick way to rebuild deb packages
I'll definitely take a look at make-debs.sh, looks promising. Thanks for the hint. I can see it's using ccache, let's see how fast it is :) What build times are you experiencing ? On Wed, 22 Jul 2015 08:04:44 + Zhou, Yuan yuan.z...@intel.com wrote: I'm also using make-debs.sh to generate the binaries for some local deployment. Note that if you need the *tests.deb you'll need to change this scripts a bit. @@ -58,8 +58,8 @@ tar -C $releasedir -zxf $releasedir/ceph_$vers.orig.tar.gz # cp -a debian $releasedir/ceph-$vers/debian cd $releasedir -perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control -perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules +#perl -ni -e 'print if(!(/^Package: .*-dbg$/../^$/))' ceph-$vers/debian/control +#perl -pi -e 's/--dbg-package.*//' ceph-$vers/debian/rules # # always set the debian version to 1 which is ok because the debian # directory is included in the sources and the upstream version will -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Wednesday, July 22, 2015 2:32 PM To: Bartłomiej Święcki; ceph-devel@vger.kernel.org Subject: Re: quick way to rebuild deb packages Hi, Did you try https://github.com/ceph/ceph/blob/master/make-debs.sh ? I would recommend running https://github.com/ceph/ceph/blob/master/run-make-check.sh first to make sure you can build and test: this will install the dependencies you're missing at the same time. Cheers On 21/07/2015 18:15, Bartłomiej Święcki wrote: Hi all, I'm currently working on a test environment for ceph where we're using deb files to deploy new version on test cluster. To make this work efficiently I'd have to quckly build deb packages. I tried dpkg-buildpackages -nc which should keep the results of previous build but it ends up in a linking error: ... CXXLDceph_rgw_jsonparser ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr': /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79: more undefined references to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid (*)()': ... Any ideas on what could go wrong here ? Version I'm compiling is v0.94.1 but I've observed same results with 9.0.1. -- Loïc Dachary, Artisan Logiciel Libre -- Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inactive PGs should trigger a HEALTH_ERR state
On Wed, 22 Jul 2015, Wido den Hollander wrote: Hi, I was just testing with a cluster on VMs and I noticed that undersized+degraded+peering PGs do not trigger a HEALTH_ERR state. Why is that? In my opinion any PG which is not active+? should trigger a HEALTH_ERR state since I/O is blocking at that point. Is that a sane thing to do or am I missing something? IIRC they trigger a WARN state until they are 'stuck' inactive, at which point they trigger an ERR state. The idea is that it is totally normal for PGs to be in an inactive state for short periods due to normal cluster churn--it's only problematic if they get stuck there. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: The design of the eviction improvement
Hi Allen, -Original Message- From: Allen Samuels [mailto:allen.samu...@sandisk.com] Sent: Thursday, July 23, 2015 2:41 AM To: Sage Weil; Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. This is true. If we have lots of small objects/files, the memory used for LRU lists could be extremely large. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. The LSM mechanism could solve the memory consumption problem. But I guess the process to choose which objects to evict is complex and inefficient. Also, after evicting some objects, we need to update the on-disk file to remove the entries of these objects. This is inefficient, too. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite
RE: The design of the eviction improvement
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, July 23, 2015 2:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. This looks similar to what we do now, except it keeps a LRU of the object-atime mapping in RAM, instead of using a hitset. The object age calculated using the atime would be more accurate than the current hitset and mtime approach. One comment is that I think if we can find a record of an object in the in-memory LRU list, we don't need to query the DB since the atime in the LRU list is for sure newer than that in the db (if it has). My concern on this approach is whether the evict decision made by the histogram estimation approach is good enough. It only measures 'recency'. And it made the decision based on some threshold, not starting from the oldest. In contrast, most of the practical algorithms made the decision based on both 'recency' and 'frequency' (accessed once recently vs. accessed twice or more recently). If we believe the histogram estimation approach is good enough, I think we can easily integrate the idea above with 2Q. 1) The in-memory LRU lists are the same as 2Q. i.e., there are active/inactive lists, and the movements between the two lists are the same as what I stated in the original mail. But we have a limit of the size of the lists. Only the top N hottest objects are kept in the lists. 2) When the size of the lists exceed N*(1.x), evict the oldest items (N .. N*1.x) to db store. 3) N could be dynamically adjusted based on the average size of objects in the PG. 4) Evict decision are made by the histogram estimation approach. Plus I think we want to evict those objects which are not in the in-memory lists first. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers
RE: building just src/tools/rados
Hi Tom, Have you tried cd src; make rados? Regards, Igor. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Wednesday, July 22, 2015 10:13 PM To: ceph-devel Subject: building just src/tools/rados Is there a make command that would build just the src/tools or even just src/tools/rados ? -- Tom Deneau -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About Fio backend with ObjectStore API
Hi Haomai, Sorry for the late response, I was out of the office. I'm afraid I haven't run into that segfault. The io_ops should be set at the very beginning when it calls get_ioengine(). All I can suggest is that you verify that your job file is pointing to the correct fio_ceph_objectstore.so. If you've made any other interesting changes to the job file, could you share it here? Casey - Original Message - From: Haomai Wang haomaiw...@gmail.com To: Casey Bodley cbod...@gmail.com Cc: Matt W. Benjamin m...@cohortfs.com, James (Fei) Liu-SSI james@ssi.samsung.com, ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 7:50:32 AM Subject: Re: About Fio backend with ObjectStore API Hi Casey, I check your commits and know what you fixed. I cherry-picked your new commits but I still met the same problem. It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? Do you have any hint about this? On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote: Hi Haomai, I was able to run this after a couple changes to the filestore.fio job file. Two of the config options were using the wrong names. I pushed a fix for the job file, as well as a patch that renames everything from filestore to objectstore (thanks James), to https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore. I found that the read support doesn't appear to work anymore, so give rw=write a try. And because it does a mkfs(), make sure you're pointing it to an empty xfs directory with the directory= option. Casey On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote: Anyone who have successfully ran the fio with this external io engine ceph_objectstore? It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote: I have rebased the branch with master, and push it to ceph upstream repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1 Plz let me know if who is working on this. Otherwise, I would like to improve this to be merge ready. On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote: That makes sense. Matt - James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Got it. I was directed to the old code base. By the way, Since the testing case was used to exercise all of object stores. Strongly recommend to change the name from fio_ceph_filestore.cc to fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should be refactored to reflect that the whole objectstore will be supported by fio_ceph_objectstore.cc. what you think? Let me know if you need any help from my side. Regards, James -Original Message- From: Casey Bodley [mailto:cbod...@gmail.com] Sent: Thursday, July 09, 2015 12:32 PM To: James (Fei) Liu-SSI Cc: Haomai Wang; ceph-devel@vger.kernel.org Subject: Re: About Fio backend with ObjectStore API Hi James, Are you looking at the code from https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It uses ObjectStore::create() instead of new FileStore(). This allows us to exercise all of the object stores with the same code. Casey On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Here is the code in the fio_ceph_filestore.cc. Basically, it creates a filestore as backend engine for IO exercises. If we got to send IO commands to KeyValue Store or Newstore, we got to change the code accordingly, right? I did not see any other files like fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble opinion, we might need to create other two fio engines for keyvaluestore and newstore if we want to exercise these two, right? Regards, James static int fio_ceph_filestore_init(struct thread_data *td) 209 { 210 vectorconst char* args; 211 struct ceph_filestore_data *ceph_filestore_data = (struct ceph_filestore_data *) td-io_ops-data; 212 ObjectStore::Transaction ft; 213 214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD, CODE_ENVIRONMENT_UTILITY, 0); 215 //g_conf-journal_dio = false; 216 common_init_finish(g_ceph_context); 217 //g_ceph_context-_conf-set_val(debug_filestore, 20); 218 //g_ceph_context-_conf-set_val(debug_throttle, 20); 219 g_ceph_context-_conf-apply_changes(NULL); 220 221 ceph_filestore_data-osd_path = strdup(/mnt/fio_ceph_filestore.XXX); 222 ceph_filestore_data-journal_path = strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX); 223 224 if (!mkdtemp(ceph_filestore_data-osd_path)) { 225 cout
RE: The design of the eviction improvement
On Wed, 22 Jul 2015, Allen Samuels wrote: Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index. Iterating over the LSM should be quite cheap. I'd be more worried about the cost of the insertions. I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ). Maybe... On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size. We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism. Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant... sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to
RE: The design of the eviction improvement
Yes the cost of the insertions with the current scheme is probably prohibitive. Wouldn't it approach the same amount of time as just having atime turned on in the file system? My concern about the memory is mostly that we ensure whatever algorithm is selected degrades gracefully when you get high counts of small objects. I agree that paying $ for RAM that translates into actual performance isn't really a problem. It really boils down to your workload and access pattern. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, July 22, 2015 2:53 PM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index. Iterating over the LSM should be quite cheap. I'd be more worried about the cost of the insertions. I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ). Maybe... On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size. We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism. Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant... sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage
Re: About Fio backend with ObjectStore API
no special [global] #logging #write_iops_log=write_iops_log #write_bw_log=write_bw_log #write_lat_log=write_lat_log ioengine=./ceph-int/src/.libs/libfio_ceph_objectstore.so invalidate=0 # mandatory rw=write #bs=4k [filestore] iodepth=1 # create a journaled filestore objectstore=filestore directory=./osd/ filestore_journal=./osd/journal On Thu, Jul 23, 2015 at 4:56 AM, Casey Bodley cbod...@redhat.com wrote: Hi Haomai, Sorry for the late response, I was out of the office. I'm afraid I haven't run into that segfault. The io_ops should be set at the very beginning when it calls get_ioengine(). All I can suggest is that you verify that your job file is pointing to the correct fio_ceph_objectstore.so. If you've made any other interesting changes to the job file, could you share it here? Casey - Original Message - From: Haomai Wang haomaiw...@gmail.com To: Casey Bodley cbod...@gmail.com Cc: Matt W. Benjamin m...@cohortfs.com, James (Fei) Liu-SSI james@ssi.samsung.com, ceph-devel@vger.kernel.org Sent: Tuesday, July 21, 2015 7:50:32 AM Subject: Re: About Fio backend with ObjectStore API Hi Casey, I check your commits and know what you fixed. I cherry-picked your new commits but I still met the same problem. It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? Do you have any hint about this? On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote: Hi Haomai, I was able to run this after a couple changes to the filestore.fio job file. Two of the config options were using the wrong names. I pushed a fix for the job file, as well as a patch that renames everything from filestore to objectstore (thanks James), to https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore. I found that the read support doesn't appear to work anymore, so give rw=write a try. And because it does a mkfs(), make sure you're pointing it to an empty xfs directory with the directory= option. Casey On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote: Anyone who have successfully ran the fio with this external io engine ceph_objectstore? It's strange that it alwasys hit segment fault when entering _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but when I up the stack, the td-io_ops is not null. Maybe it's related to dlopen? On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote: I have rebased the branch with master, and push it to ceph upstream repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1 Plz let me know if who is working on this. Otherwise, I would like to improve this to be merge ready. On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote: That makes sense. Matt - James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Got it. I was directed to the old code base. By the way, Since the testing case was used to exercise all of object stores. Strongly recommend to change the name from fio_ceph_filestore.cc to fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should be refactored to reflect that the whole objectstore will be supported by fio_ceph_objectstore.cc. what you think? Let me know if you need any help from my side. Regards, James -Original Message- From: Casey Bodley [mailto:cbod...@gmail.com] Sent: Thursday, July 09, 2015 12:32 PM To: James (Fei) Liu-SSI Cc: Haomai Wang; ceph-devel@vger.kernel.org Subject: Re: About Fio backend with ObjectStore API Hi James, Are you looking at the code from https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It uses ObjectStore::create() instead of new FileStore(). This allows us to exercise all of the object stores with the same code. Casey On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Casey, Here is the code in the fio_ceph_filestore.cc. Basically, it creates a filestore as backend engine for IO exercises. If we got to send IO commands to KeyValue Store or Newstore, we got to change the code accordingly, right? I did not see any other files like fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble opinion, we might need to create other two fio engines for keyvaluestore and newstore if we want to exercise these two, right? Regards, James static int fio_ceph_filestore_init(struct thread_data *td) 209 { 210 vectorconst char* args; 211 struct ceph_filestore_data *ceph_filestore_data = (struct ceph_filestore_data *) td-io_ops-data; 212 ObjectStore::Transaction ft; 213 214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD, CODE_ENVIRONMENT_UTILITY, 0); 215 //g_conf-journal_dio = false; 216 common_init_finish(g_ceph_context); 217
07/22/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! Topics today include a new ceph_test_rados benchmark being added to CBT. Please feel free to add your own! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html