Re: [PATCH 7/8] rbd: do not read in parent info before snap context
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base I had some trouble understanding this. The parent image, snapshot id and overlap, are together a property of a particular image (and associated with its header object). Nothing about that is dependent on the child image snapshots or snapshot id. However... image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for). I finally figured out what path through the code you were talking about. Here's what I see. On the initial probe (previously), we have: rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_parent_info() | | | -- expects rbd_dev-spec-snap_id to be valid | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_probe_parent() So clearly, at least when mapping a snapshot, we would not get the desired result. We'll be unable to look up the id for the named snapshot, so would get ENOENT. Now you've pulled getting the parent info back out into rbd_dev_image_probe(): rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_v2_parent_info() | | -- rbd_dev-spec-snap_id will be valid |rbd_dev_probe_parent() In the refresh path it's similar. You move the rbd_dev_v2_parent_info() call into rbd_dev_refresh() instead of it happening in rbd_dev_v2_header_info(). Missing the ordering problem here might have caused even more subtle problems (due to using an apparently valid but possibly out-of-date snapshot context). Given this understanding I'd say your change looks good. The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. I'll think about this and respond to your followup e-mail. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 64 --- 1 file changed, 30 insertions(+), 34 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 16f388f2960b..c4606987e9d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); static int rbd_dev_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) mapping_size = rbd_dev-mapping.size; ret = rbd_dev_header_info(rbd_dev); This looked odd. But I guess I didn't notice you didn't check the return value when you first added this line a few patches ago... + if (ret) + return ret; + + /* + * If there is a parent, see if it has disappeared due to the + * mapped image getting flattened. + */ + if (rbd_dev-parent) { + ret = rbd_dev_v2_parent_info(rbd_dev); + if (ret) + return ret; + } if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { if (rbd_dev-mapping.size != rbd_dev-header.image_size) @@ -3524,11 +3537,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) up_write(rbd_dev-header_rwsem); - if (mapping_size != rbd_dev-mapping.size) { + if (mapping_size != rbd_dev-mapping.size) rbd_dev_update_size(rbd_dev); - } - return ret; + return 0; } static int rbd_init_disk(struct rbd_device *rbd_dev) @@ -4477,33 +4489,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret; } - /* - * If the
Re: [PATCH 7/8] rbd: do not read in parent info before snap context
On Fri, Jul 25, 2014 at 12:14 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base I had some trouble understanding this. The parent image, snapshot id and overlap, are together a property of a particular image (and associated with its header object). Nothing about that is dependent on the child image snapshots or snapshot id. However... image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for). I finally figured out what path through the code you were talking about. Here's what I see. On the initial probe (previously), we have: rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_parent_info() | | | -- expects rbd_dev-spec-snap_id to be valid | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_probe_parent() So clearly, at least when mapping a snapshot, we would not get the desired result. We'll be unable to look up the id for the named snapshot, so would get ENOENT. Now you've pulled getting the parent info back out into rbd_dev_image_probe(): rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_v2_parent_info() | | -- rbd_dev-spec-snap_id will be valid |rbd_dev_probe_parent() In the refresh path it's similar. You move the rbd_dev_v2_parent_info() call into rbd_dev_refresh() instead of it happening in rbd_dev_v2_header_info(). Missing the ordering problem here might have caused even more subtle problems (due to using an apparently valid but possibly out-of-date snapshot context). Given this understanding I'd say your change looks good. Yeah, basically any time we read in snapshot metadata (both when mapping a snapshot or following a chain of parent images) we need to look at the parent_overlap of the snapshot (i.e. the parent_overlap of the base image at the time the snapshot was taken). That requires having snap_id at hand when the call to rbd_dev_v2_parent_info() is made. The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. I'll think about this and respond to your followup e-mail. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 64 --- 1 file changed, 30 insertions(+), 34 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 16f388f2960b..c4606987e9d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); static int rbd_dev_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) mapping_size = rbd_dev-mapping.size; ret = rbd_dev_header_info(rbd_dev); This looked odd. But I guess I didn't notice you didn't check the return value when you first added this line a few patches ago... This is what rbd_dev_refresh() did before any of my changes. It would go ahead with trying to update mapping size and validating snapshot existance even if rbd_dev_v{1,2}_header_info() had failed. + if (ret) + return ret; I'll move this bit to harden rbd_dev_refresh() commit. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: remove redundant memset(0)
xattrs array of pointers is allocated with kcalloc() - no need to memset() it to 0 right after that. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- fs/ceph/xattr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index c9c2b887381e..f89698cdbc41 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -597,7 +597,7 @@ start: err = -ENOMEM; if (!xattrs) goto bad_lock; - memset(xattrs, 0, numattr*sizeof(struct ceph_xattr *)); + for (i = 0; i numattr; i++) { xattrs[i] = kmalloc(sizeof(struct ceph_inode_xattr), GFP_NOFS); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: fix sizeof(struct tYpO *) typo
struct ceph_xattr - struct ceph_inode_xattr Reported-by: Toralf Förster toralf.foers...@gmx.de Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- fs/ceph/xattr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index f89698cdbc41..12f58d22e017 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -592,7 +592,7 @@ start: xattr_version = ci-i_xattrs.version; spin_unlock(ci-i_ceph_lock); - xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), + xattrs = kcalloc(numattr, sizeof(struct ceph_inode_xattr *), GFP_NOFS); err = -ENOMEM; if (!xattrs) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sizeof (struct tYpO *) : it is just a typo or rather a bug ?
On Thu, 24 Jul 2014, Toralf Förster wrote: Inspired by this typo fix http://article.gmane.org/gmane.linux.kernel/1754640 I grep'ed the current git tree of linus for similar issues. I wonder if we couldn't use Coccinelle to do that? I would say it would be not as cool as deep grep magick, but Coccinelle is cool by definition and therefore immune from any such comparisons :-) Nevertheless I was told, that gcc doesn't complain about such things due to eventually evaluating it to sizeof(null). I'm however curious if at least a warning should be emitted in such a case, or? Well, it cannot become a real bug because the moment the code changes to actually access/derreference such a typo, it will cause the compiler to abort with an error. If gcc will have to waste a measurable amount of time to issue such a warning, it is not worth it. OTOH, such typos could confuse someone reading the code into thinking they're dealing with a different structure or something, and it _is_ incorrect code no matter how harmless, so it makes sense to fix all such typos eventually. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forcing Ceph into mapping all objects to a single PG
The main issue however is not the hash's strength, but the fact that once pre-computed, I'm able to use preimages on **every Ceph cluster out there**. (As the hash functions's output is a deterministic function of the object's name only) I agree in that the general issue is inherent in hash-placement systems. But what I don't agree with is the following: Why do I have to be able to calculate my object's placement for **every Ceph cluster** out there? Why does it not suffice for me to be able to calculate the placement only for the cluster I'm currently accessing? From a logical standpoint it seems reasonable. Why then, are we not able to constrain the placement calculation in that regard? If the placement is bound to a specific cluster it should suffice to derive e.g. a key for SipHash based on cluster specifics. Is this doable from an implementation point of view? Note: I only did this as a proof-of-concept for the object store. Think about the implications, if you're able to do this e.g. for every RadosGW out there and servies using RadosGW. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] rbd: do not read in parent info before snap context
On 07/25/2014 03:36 AM, Ilya Dryomov wrote: + if (ret) + return ret; I'll move this bit to harden rbd_dev_refresh() commit. Sounds good.-Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo
On 07/25/2014 04:26 AM, Ilya Dryomov wrote: struct ceph_xattr - struct ceph_inode_xattr Looks good. I can't find the definition of ceph_xattr. Reviewed-by: Alex Elder el...@linaro.org Reported-by: Toralf Förster toralf.foers...@gmx.de Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- fs/ceph/xattr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index f89698cdbc41..12f58d22e017 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -592,7 +592,7 @@ start: xattr_version = ci-i_xattrs.version; spin_unlock(ci-i_ceph_lock); - xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), + xattrs = kcalloc(numattr, sizeof(struct ceph_inode_xattr *), GFP_NOFS); err = -ENOMEM; if (!xattrs) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo
On Fri, Jul 25, 2014 at 4:52 PM, Alex Elder el...@ieee.org wrote: On 07/25/2014 04:26 AM, Ilya Dryomov wrote: struct ceph_xattr - struct ceph_inode_xattr Looks good. I can't find the definition of ceph_xattr. That's the point of the patch ;) Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 10:10 AM, Ilya Dryomov wrote: On Thu, Jul 24, 2014 at 5:46 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. BTW, while we are on the subject of struct rbd_mapping, is there any particular reason to keep around both the base image values and actual mapping values? I am tempted to change the mapping sequence so that 1) snap context is read in immediately after watch setup, before anything else is done, 2) supplied snap name is resolved into an id and 3) the actual (i.e. based on snap_id) mapping size, features, parent_overlap, etc are read in directly into struct rbd_image_header. That would allow to rip struct rbd_mapping entirely and make the code more clear. The rbd_mapping structure started out well-intentioned but over time it seems clear it hasn't added the value it was intended to add. Here's where it got created: f84344f rbd: separate mapping info in rbd_dev The only original field that survives is read_only. There's no harm at all in just moving the fields in that structure out into the rbd_device structure. Preserving the base image size in the header structure is an artifact of the version 1 code, where it held the last version of the on-disk header data. The version 2 code maintains it for consistency. You could eliminate that I suppose. I think the result would require rbd_header_from_disk() to know about the mapping rather than doing a simple conversion of data from one form to another. I say try it, and if you like the result I probably will too... -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: S3 API Compatibility support
Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ün supported item in the above table? or Any working on this? Thanks Swami -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forcing Ceph into mapping all objects to a single PG
On Fri, 25 Jul 2014, Daniel Hofmann wrote: The main issue however is not the hash's strength, but the fact that once pre-computed, I'm able to use preimages on **every Ceph cluster out there**. (As the hash functions's output is a deterministic function of the object's name only) I agree in that the general issue is inherent in hash-placement systems. But what I don't agree with is the following: Why do I have to be able to calculate my object's placement for **every Ceph cluster** out there? Why does it not suffice for me to be able to calculate the placement only for the cluster I'm currently accessing? From a logical standpoint it seems reasonable. Why then, are we not able to constrain the placement calculation in that regard? If the placement is bound to a specific cluster it should suffice to derive e.g. a key for SipHash based on cluster specifics. Is this doable from an implementation point of view? Note: I only did this as a proof-of-concept for the object store. Think about the implications, if you're able to do this e.g. for every RadosGW out there and servies using RadosGW. It would be really easy to add a random salt to the pg_pool_t and feed that into the object - pg hash mapping. Note, by the way, that for new clusters the pool id is already fed in here, so there is a *tiny* bit of variation depending on what orders the pools were created in (although probably not enough to meaningfully improve security). We could also add a new hash type that is more secure. Rjenkins is used by default but the choice of hash is already parameterized. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: First attempt at rocksdb monitor store stress testing
On 07/24/2014 08:28 PM, Shu, Xinxin wrote: Hi mark, I am looking forward to your results on SSDs . Me too! rocksdb generates a crc of data to be written , this cannot be switch off (but can be subsititued with xxhash), there are two options , Option. verify_checksums_in_compaction and ReadOptions. verify_checksums, If we disable these two options , i think cpu usage will goes down . If we use universal compaction , this is not friendly with read operation. I'm wondering if it might not be so bad for us given the kind of work the mon does. We write out a lot of maps and incrementals, but I don't think the mon goes back and updates objects very often. Assuming I understand how universal vs level compaction works (I might not!) this should help contain the number of SST files that objects get spread across which causes all of the extra read seeks with universal compaction. Btw , can you list your rocksdb configuration? Sure, right now it's all the stock defaults in config_opts except the new option I added for universal compaction. I am hoping I can run some more tests today and this weekend with tuned ones. Cheers, xinxin -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Friday, July 25, 2014 7:45 AM To: Shu, Xinxin; ceph-devel@vger.kernel.org Subject: Re: First attempt at rocksdb monitor store stress testing Earlier today I modified the rocksdb options so I could enable universal compaction. Over all performance is lower but I don't see the hang/stall in the middle of the test either. Instead the disk is basically pegged with 100% writes. I suspect average latency is higher than leveldb, but the highest latency is about 5-6s while we were seeing 30s spikes for leveldb with levelled (heh) compaction. I haven't done much tuning either way yet. It may be that if we keep level 0 and level 1 roughly the same size we can reduce stalls in the levelled setups. It will also be interesting to see what happens in these tests on SSDs. Mark On 07/24/2014 06:13 AM, Mark Nelson wrote: Hi Xinxin, Thanks! I wonder as well if it might be interesting to expose the options related to universal compaction? It looks like rocksdb provides a lot of interesting knobs you can adjust! Mark On 07/24/2014 12:08 AM, Shu, Xinxin wrote: Hi mark, I think this maybe related to 'verify_checksums' config option ,when ReadOptions is initialized, default this option is true , all data read from underlying storage will be verified against corresponding checksums, however, this option cannot be configured in wip-rocksdb branch. I will modify code to make this option configurable . Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Thursday, July 24, 2014 7:14 AM To: ceph-devel@vger.kernel.org Subject: First attempt at rocksdb monitor store stress testing Hi Guys, So I've been interested lately in leveldb 99th percentile latency (and the amount of write amplification we are seeing) with leveldb. Joao mentioned he has written a tool called mon-store-stress in wip-leveldb-misc to try to provide a means to roughly guess at what's happening on the mons under heavy load. I cherry-picked it over to wip-rocksdb and after a couple of hacks was able to get everything built and running with some basic tests. There was little tuning done and I don't know how realistic this workload is, so don't assume this means anything yet, but some initial results are here: http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf Command that was used to run the tests: ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb --write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo The most interesting bit right now is that rocksdb seems to be hanging in the middle of the test (left it running for several hours). CPU usage on one core was quite high during the hang. Profiling using perf with dwarf symbols I see: - 49.14% ceph-test-mon-s ceph-test-mon-store-stress [.] unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) - unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) 51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*, rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*, bool) 48.30% rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, rocksdb::CompressionType, rocksdb::BlockHandle*) Not sure what's going on yet, may need to try to enable logging/debugging in rocksdb. Thoughts/Suggestions welcome. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo
Re: Fwd: S3 API Compatibility support
On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote: Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ?n supported item in the above table? or Any working on this? Yes. Unfortunately this table isn't particularly detailed or accurate or up to date. The main gap, I think, is versioned objects. Are there specfiic parts of the S3 API that are missing that you need? That sort of info is very helpful for prioritizing effort... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo
On 07/25/2014 08:27 AM, Ilya Dryomov wrote: On Fri, Jul 25, 2014 at 4:52 PM, Alex Elder el...@ieee.org wrote: On 07/25/2014 04:26 AM, Ilya Dryomov wrote: struct ceph_xattr - struct ceph_inode_xattr Looks good. I can't find the definition of ceph_xattr. That's the point of the patch ;) Well, I thought the point was that the type was wrong; I was surprised that the type was non-existent. -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: First attempt at rocksdb monitor store stress testing
Hi Xinxin, I'm trying to enable the rocksdb log file as described in config_opts using: rocksdb_log = path to log file The file gets created but is empty. Any ideas? Mark On 07/24/2014 08:28 PM, Shu, Xinxin wrote: Hi mark, I am looking forward to your results on SSDs . rocksdb generates a crc of data to be written , this cannot be switch off (but can be subsititued with xxhash), there are two options , Option. verify_checksums_in_compaction and ReadOptions. verify_checksums, If we disable these two options , i think cpu usage will goes down . If we use universal compaction , this is not friendly with read operation. Btw , can you list your rocksdb configuration? Cheers, xinxin -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Friday, July 25, 2014 7:45 AM To: Shu, Xinxin; ceph-devel@vger.kernel.org Subject: Re: First attempt at rocksdb monitor store stress testing Earlier today I modified the rocksdb options so I could enable universal compaction. Over all performance is lower but I don't see the hang/stall in the middle of the test either. Instead the disk is basically pegged with 100% writes. I suspect average latency is higher than leveldb, but the highest latency is about 5-6s while we were seeing 30s spikes for leveldb with levelled (heh) compaction. I haven't done much tuning either way yet. It may be that if we keep level 0 and level 1 roughly the same size we can reduce stalls in the levelled setups. It will also be interesting to see what happens in these tests on SSDs. Mark On 07/24/2014 06:13 AM, Mark Nelson wrote: Hi Xinxin, Thanks! I wonder as well if it might be interesting to expose the options related to universal compaction? It looks like rocksdb provides a lot of interesting knobs you can adjust! Mark On 07/24/2014 12:08 AM, Shu, Xinxin wrote: Hi mark, I think this maybe related to 'verify_checksums' config option ,when ReadOptions is initialized, default this option is true , all data read from underlying storage will be verified against corresponding checksums, however, this option cannot be configured in wip-rocksdb branch. I will modify code to make this option configurable . Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Thursday, July 24, 2014 7:14 AM To: ceph-devel@vger.kernel.org Subject: First attempt at rocksdb monitor store stress testing Hi Guys, So I've been interested lately in leveldb 99th percentile latency (and the amount of write amplification we are seeing) with leveldb. Joao mentioned he has written a tool called mon-store-stress in wip-leveldb-misc to try to provide a means to roughly guess at what's happening on the mons under heavy load. I cherry-picked it over to wip-rocksdb and after a couple of hacks was able to get everything built and running with some basic tests. There was little tuning done and I don't know how realistic this workload is, so don't assume this means anything yet, but some initial results are here: http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf Command that was used to run the tests: ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb --write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo The most interesting bit right now is that rocksdb seems to be hanging in the middle of the test (left it running for several hours). CPU usage on one core was quite high during the hang. Profiling using perf with dwarf symbols I see: - 49.14% ceph-test-mon-s ceph-test-mon-store-stress [.] unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) - unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) 51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*, rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*, bool) 48.30% rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, rocksdb::CompressionType, rocksdb::BlockHandle*) Not sure what's going on yet, may need to try to enable logging/debugging in rocksdb. Thoughts/Suggestions welcome. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: S3 API Compatibility support
Thanks for quick reply. Yes, versioned object - missing in ceph ATM Iam looking for: bucket lifecylce (get/put/delete), bucket location, put object notification and object restore (ie versioned object) S3 API support. Please let me now any of the above work is in progress or some one planned to work on. Thanks Swami On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote: On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote: Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ?n supported item in the above table? or Any working on this? Yes. Unfortunately this table isn't particularly detailed or accurate or up to date. The main gap, I think, is versioned objects. Are there specfiic parts of the S3 API that are missing that you need? That sort of info is very helpful for prioritizing effort... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: S3 API Compatibility support
On Fri, Jul 25, 2014 at 10:14 AM, M Ranga Swami Reddy swamire...@gmail.com wrote: Thanks for quick reply. Yes, versioned object - missing in ceph ATM Iam looking for: bucket lifecylce (get/put/delete), bucket location, put object notification and object restore (ie versioned object) S3 API support. Please let me now any of the above work is in progress or some one planned to work on. I opened an issue for bucket lifecycle (we already had an issue open for object expiration though). We do have bucket location already (part of the multi-region feature). Object versioning is definitely on our backlog and one that we'll hopefully implement sooner rather later. With regard to object notification, it'll require having a notification service which is a bit out of the scope. Integrating the gateway with such a service whouldn't be hard, but we'll need to have that first. Yehuda Thanks Swami On Fri, Jul 25, 2014 at 9:19 PM, Sage Weil sw...@redhat.com wrote: On Fri, 25 Jul 2014, M Ranga Swami Reddy wrote: Hi Team: As per the ceph document a few S3 APIs compatibility not supported. Link: http://ceph.com/docs/master/radosgw/s3/ Is there plan to support the ?n supported item in the above table? or Any working on this? Yes. Unfortunately this table isn't particularly detailed or accurate or up to date. The main gap, I think, is versioned objects. Are there specfiic parts of the S3 API that are missing that you need? That sort of info is very helpful for prioritizing effort... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forcing Ceph into mapping all objects to a single PG
Sage Weil wrote: On Fri, 25 Jul 2014, Daniel Hofmann wrote: The main issue however is not the hash's strength, but the fact that once pre-computed, I'm able to use preimages on **every Ceph cluster out there**. (As the hash functions's output is a deterministic function of the object's name only) I agree in that the general issue is inherent in hash-placement systems. But what I don't agree with is the following: Why do I have to be able to calculate my object's placement for **every Ceph cluster** out there? Why does it not suffice for me to be able to calculate the placement only for the cluster I'm currently accessing? From a logical standpoint it seems reasonable. Why then, are we not able to constrain the placement calculation in that regard? If the placement is bound to a specific cluster it should suffice to derive e.g. a key for SipHash based on cluster specifics. Is this doable from an implementation point of view? Note: I only did this as a proof-of-concept for the object store. Think about the implications, if you're able to do this e.g. for every RadosGW out there and servies using RadosGW. It would be really easy to add a random salt to the pg_pool_t and feed that into the object - pg hash mapping. Note, by the way, that for new clusters the pool id is already fed in here, so there is a *tiny* bit of variation depending on what orders the pools were created in (although probably not enough to meaningfully improve security). I just had a thought - is the FSID a parameter? If it is, then there's actually no need for a further salt: FSID is per cluster, preventing cross- cluster precomputation, and if the pool ID is fed in then each pool needs collisions calculated separately as well. In other words, the (fsid, poolid) tuple is almost certainly sufficient salt in and of itself. We could also add a new hash type that is more secure. Rjenkins is used by default but the choice of hash is already parameterized. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
LTTng unfriendly with mixed static/dynamic linking
LTTng requires tracepoints to be linked into a program only once. If tracepoints are linked in multiple times, the program crashes at startup with: LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate registration of tracepoint probes having the same name is not allowed. This is problematic when mixing static and dynamic linking. If the tracepoints are in a static library, that library can end up in an executable multiple times by being linked in directly, as well as being statically linked into a dynamic library. Even if the tracepoints are not linked directly into the executable, they can be statically linked into multiple dynamic libraries that the executable loads. For us, this problem shows up with libcommon, and could show up with others such as libosd. (In general, I think anything added to noinst_LTLIBRARIES is static, and anything added to lib_LTLIBRARIES is dynamic.) There are a few ways of solving the issue: 1. Change every library that has tracepoints, like libcommon, from static to dynamic. This could be a big change, as at the very least we'd have to rename the library to something like libceph_common to avoid conflicts, since now it would be an installed file. This has the advantage of keeping code and its tracepoints in the same library. 2. Keep tracepoints in a separate static library. For example, libcommon and libcommon_tp. Unfortunately, every executable (but not library!) that links in libcommon (directly or indirectly) would have to manually link in libcommon_tp, and I don't think Automake could automate that. 3. Keep tracepoints in a separate dynamic library. In this case, I think libcommon could depend on libcommon_tp, so executables would not have to manually link in libcommon_tp. (I'm not an Automake expert, so let me know if I'm wrong on that.) Again, libcommon_tp would be an installed file, so we'd want to rename it to something like libceph_common_tp. I attached a minimal test case of the problem. Thoughts or suggestions? Thanks, Adam Crume lttng-duplicate-tracepoints.tar.gz Description: GNU Zip compressed data
Re: LTTng unfriendly with mixed static/dynamic linking
I tried all solutions, and it looks like only #1 works. #2 gives the error /usr/bin/ld: main: hidden symbol `tracepoint_dlopen' in common_tp.a(common_tp.o) is referenced by DSO when linking. #3 gives the error ./liblibrary.so: undefined reference to `tracepoint_dlopen' when linking. (Linking is complicated by the fact that LTTng uses special symbol attributes, and tracepoint_dlopen happens to be weak and hidden.) Unless I'm mistaken (and I very well may be), we will have to ensure that all traced code is either 1) placed in a shared library and never statically linked elsewhere, or 2) never linked into any shared library. Thoughts? Adam On Fri, Jul 25, 2014 at 11:48 AM, Adam Crume adamcr...@gmail.com wrote: LTTng requires tracepoints to be linked into a program only once. If tracepoints are linked in multiple times, the program crashes at startup with: LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate registration of tracepoint probes having the same name is not allowed. This is problematic when mixing static and dynamic linking. If the tracepoints are in a static library, that library can end up in an executable multiple times by being linked in directly, as well as being statically linked into a dynamic library. Even if the tracepoints are not linked directly into the executable, they can be statically linked into multiple dynamic libraries that the executable loads. For us, this problem shows up with libcommon, and could show up with others such as libosd. (In general, I think anything added to noinst_LTLIBRARIES is static, and anything added to lib_LTLIBRARIES is dynamic.) There are a few ways of solving the issue: 1. Change every library that has tracepoints, like libcommon, from static to dynamic. This could be a big change, as at the very least we'd have to rename the library to something like libceph_common to avoid conflicts, since now it would be an installed file. This has the advantage of keeping code and its tracepoints in the same library. 2. Keep tracepoints in a separate static library. For example, libcommon and libcommon_tp. Unfortunately, every executable (but not library!) that links in libcommon (directly or indirectly) would have to manually link in libcommon_tp, and I don't think Automake could automate that. 3. Keep tracepoints in a separate dynamic library. In this case, I think libcommon could depend on libcommon_tp, so executables would not have to manually link in libcommon_tp. (I'm not an Automake expert, so let me know if I'm wrong on that.) Again, libcommon_tp would be an installed file, so we'd want to rename it to something like libceph_common_tp. I attached a minimal test case of the problem. Thoughts or suggestions? Thanks, Adam Crume lttng-duplicate-tracepoints-sol1.tar.gz Description: GNU Zip compressed data lttng-duplicate-tracepoints-sol2.tar.gz Description: GNU Zip compressed data lttng-duplicate-tracepoints-sol3.tar.gz Description: GNU Zip compressed data
Re: LTTng unfriendly with mixed static/dynamic linking
On Fri, 25 Jul 2014, Adam Crume wrote: I tried all solutions, and it looks like only #1 works. #2 gives the error /usr/bin/ld: main: hidden symbol `tracepoint_dlopen' in common_tp.a(common_tp.o) is referenced by DSO when linking. #3 gives the error ./liblibrary.so: undefined reference to `tracepoint_dlopen' when linking. (Linking is complicated by the fact that LTTng uses special symbol attributes, and tracepoint_dlopen happens to be weak and hidden.) I think #1 is good for other reasons, too. We already have issues (I think!) with binaries that use librados and also link libcommon statically. Specifically, I think we've seen that having mismatched versions of librados and the binary installed lead to confusion about the contents/structure of mdconfig_t (g_conf). This is one of the reasons why the libcommon and rgw packages require an identical version of librados or librbd or whatever--to avoid this inconsistency. Unless I'm mistaken (and I very well may be), we will have to ensure that all traced code is either 1) placed in a shared library and never statically linked elsewhere, or 2) never linked into any shared library. That sounds doable and sane to me: - librados, librbd, libceph_common, etc. would have the tracepoints in the same .so - ceph-osd could have its own tracepoints, as long as they are always static. (For example, libos.la, which is linked statically by ceph-mon and ceph-osd but never dynamically.) One pain point in all of this, though, is that the libceph_common.so (or whatever) will need to go into a separate package that is required by librados.so and librbd and ceph-common and everything else. 'ceph-common' is what this ought to be called, but we've coopted it to mean 'ceph clients'. I'm not sure it if it worthwhile to go through the hinjinx to rename ceph-common to ceph-clients and repurpose ceph-common for this? sage Thoughts? Adam On Fri, Jul 25, 2014 at 11:48 AM, Adam Crume adamcr...@gmail.com wrote: LTTng requires tracepoints to be linked into a program only once. If tracepoints are linked in multiple times, the program crashes at startup with: LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate registration of tracepoint probes having the same name is not allowed. This is problematic when mixing static and dynamic linking. If the tracepoints are in a static library, that library can end up in an executable multiple times by being linked in directly, as well as being statically linked into a dynamic library. Even if the tracepoints are not linked directly into the executable, they can be statically linked into multiple dynamic libraries that the executable loads. For us, this problem shows up with libcommon, and could show up with others such as libosd. (In general, I think anything added to noinst_LTLIBRARIES is static, and anything added to lib_LTLIBRARIES is dynamic.) There are a few ways of solving the issue: 1. Change every library that has tracepoints, like libcommon, from static to dynamic. This could be a big change, as at the very least we'd have to rename the library to something like libceph_common to avoid conflicts, since now it would be an installed file. This has the advantage of keeping code and its tracepoints in the same library. 2. Keep tracepoints in a separate static library. For example, libcommon and libcommon_tp. Unfortunately, every executable (but not library!) that links in libcommon (directly or indirectly) would have to manually link in libcommon_tp, and I don't think Automake could automate that. 3. Keep tracepoints in a separate dynamic library. In this case, I think libcommon could depend on libcommon_tp, so executables would not have to manually link in libcommon_tp. (I'm not an Automake expert, so let me know if I'm wrong on that.) Again, libcommon_tp would be an installed file, so we'd want to rename it to something like libceph_common_tp. I attached a minimal test case of the problem. Thoughts or suggestions? Thanks, Adam Crume -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
issue #8937
I pushed a fix (which is not heavily tested) to wip-8937. The problem seem to have been triggered by the fix to #8928. One thing that I noticed was that previously max_chunk_size at that point was actually used as the minimum write size (how appropriate). It's used elsewhere as the max size for data drained off the front end. Basically the fix for #8928 now makes max_chunk_size to limit the rados writes to no more than that. Since up until now we weren't doing it, then I'm not too sure that this change won't have significant performance implications. The options that we have at this point is: - keep it as it is now, max_chunk_size limits rados writes - revert it to the old behavior, but add some code to handle the pool alignment when needed In any case, I don't think this is ready to go into firefly. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
wip-objecter
Hey John, I fixed up the mds osdmap wait stuff and squashed it into wip-objecter. That rebased out from underneath your branch; sorry. Will try to look at the other patches shortly! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html