Re: Long peering - throttle at FileStore::queue_transactions
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil wrote: > On Mon, 4 Jan 2016, Guang Yang wrote: >> Hi Cephers, >> Happy New Year! I got question regards to the long PG peering.. >> >> Over the last several days I have been looking into the *long peering* >> problem when we start a OSD / OSD host, what I observed was that the >> two peering working threads were throttled (stuck) when trying to >> queue new transactions (writing pg log), thus the peering process are >> dramatically slow down. >> >> The first question came to me was, what were the transactions in the >> queue? The major ones, as I saw, included: >> >> - The osd_map and incremental osd_map, this happens if the OSD had >> been down for a while (in a large cluster), or when the cluster got >> upgrade, which made the osd_map epoch the down OSD had, was far behind >> the latest osd_map epoch. During the OSD booting, it would need to >> persist all those osd_maps and generate lots of filestore transactions >> (linear with the epoch gap). >> > As the PG was not involved in most of those epochs, could we only take and >> > persist those osd_maps which matter to the PGs on the OSD? > > This part should happen before the OSD sends the MOSDBoot message, before > anyone knows it exists. There is a tunable threshold that controls how > recent the map has to be before the OSD tries to boot. If you're > seeing this in the real world, be probably just need to adjust that value > way down to something small(er). It would queue the transactions and then sends out the MOSDBoot, thus there is still a chance that it could have contention with the peering OPs (especially on large clusters where there are lots of activities which generates many osdmap epoch). Any chance we can change the *queue_transactions* to "apply_transactions*, thus we block there waiting for the persistent of the osdmap. At least we may be able to do that during OSD booting? The concern is, if the OSD is active, the apply_transaction would take longer with holding the osd_lock.. I don't find such tuning, could you elaborate? Thanks! > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD data file are OSD logs
Thanks Sam for the confirmation. Thanks, Guang On Mon, Jan 4, 2016 at 3:59 PM, Samuel Just wrote: > IIRC, you are running giant. I think that's the log rotate dangling > fd bug (not fixed in giant since giant is eol). Fixed upstream > 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is > b8e3f6e190809febf80af66415862e7c7e415214. > -Sam > > On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang wrote: >> Hi Cephers, >> Before I open a tracker, I would like check if it is a known issue or not.. >> >> One one of our clusters, there was OSD crash during repairing, the >> crash happened after we issued a PG repair for inconsistent PGs, which >> failed because the recorded file size (within xattr) mismatched with >> the actual file size. >> >> The mismatch was caused by the fact that the content of the data file >> are OSD logs, following is from osd.354 on c003: >> >> -rw-r--r-- 1 yahoo root 75168 Jan 3 07:30 >> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7 >> -bash-4.1$ head >> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7" >> 2016-01-03 07:30:01.600119 7f7fe2096700 15 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs >> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7 >> 2016-01-03 07:30:01.604967 7f7fe2096700 10 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, len is 494 >> 2016-01-03 07:30:01.604984 7f7fe2096700 10 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, got 247 >> 2016-01-03 07:30:01.604986 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.idtag' >> 2016-01-03 07:30:01.604996 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_' >> 2016-01-03 07:30:01.605007 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> 'snapset' >> 2016-01-03 07:30:01.605013 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.manifest' >> 2016-01-03 07:30:01.605026 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> 'hinfo_key' >> 2016-01-03 07:30:01.605042 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.x-amz-meta-origin' >> 2016-01-03 07:30:01.605049 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.acl' >> >> >> This only happens on the clusters we turned on the verbose log >> (debug_osd/filestore=20). And we are running ceph v0.87. >> >> Thanks, >> Guang >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD data file are OSD logs
Hi Cephers, Before I open a tracker, I would like check if it is a known issue or not.. One one of our clusters, there was OSD crash during repairing, the crash happened after we issued a PG repair for inconsistent PGs, which failed because the recorded file size (within xattr) mismatched with the actual file size. The mismatch was caused by the fact that the content of the data file are OSD logs, following is from osd.354 on c003: -rw-r--r-- 1 yahoo root 75168 Jan 3 07:30 default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7 -bash-4.1$ head "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7" 2016-01-03 07:30:01.600119 7f7fe2096700 15 filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7 2016-01-03 07:30:01.604967 7f7fe2096700 10 filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, len is 494 2016-01-03 07:30:01.604984 7f7fe2096700 10 filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, got 247 2016-01-03 07:30:01.604986 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.idtag' 2016-01-03 07:30:01.604996 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_' 2016-01-03 07:30:01.605007 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting 'snapset' 2016-01-03 07:30:01.605013 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.manifest' 2016-01-03 07:30:01.605026 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting 'hinfo_key' 2016-01-03 07:30:01.605042 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.x-amz-meta-origin' 2016-01-03 07:30:01.605049 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.acl' This only happens on the clusters we turned on the verbose log (debug_osd/filestore=20). And we are running ceph v0.87. Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Long peering - throttle at FileStore::queue_transactions
Hi Cephers, Happy New Year! I got question regards to the long PG peering.. Over the last several days I have been looking into the *long peering* problem when we start a OSD / OSD host, what I observed was that the two peering working threads were throttled (stuck) when trying to queue new transactions (writing pg log), thus the peering process are dramatically slow down. The first question came to me was, what were the transactions in the queue? The major ones, as I saw, included: - The osd_map and incremental osd_map, this happens if the OSD had been down for a while (in a large cluster), or when the cluster got upgrade, which made the osd_map epoch the down OSD had, was far behind the latest osd_map epoch. During the OSD booting, it would need to persist all those osd_maps and generate lots of filestore transactions (linear with the epoch gap). > As the PG was not involved in most of those epochs, could we only take and > persist those osd_maps which matter to the PGs on the OSD? - There are lots of deletion transactions, and as the PG booting, it needs to merge the PG log from its peers, and for the deletion PG entry, it would need to queue the deletion transaction immediately. > Could we delay the queue of the transactions until all PGs on the host are > peered? Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newly added monitor infinitely sync store
On Mon, Nov 16, 2015 at 5:42 PM, Sage Weil wrote: > On Mon, 16 Nov 2015, Guang Yang wrote: >> I spoke to a leveldb expert, it looks like this is a known pattern on >> LSM tree data structure - the tail latency for range scan could be far >> longer than avg/median since it might need to mmap several sst files >> to get the record. >> >> Hi Sage, >> Do you see any harm to increase the default value for this setting >> (e.g. 20 minutes)? Or should I add the advise for monitor >> trouble-shooting? > > The timeout is just for a round trip for the sync process, right? I think > increasing it a bit (2x or 3x?) is okay, but 20 minutes to do a single > chunk is a lot. Yeah the timeout is for a single round trip (there is timeout reset mechanism at both sides). > > The underlying problem in your cases is that your store is huge (by ~2 > orders of magnitude), so I'm not sure we should tune against that :) Ok, let me apply the patches and monitor the db growth. > > sage > > > > >> Thanks, >> Guang >> >> On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang wrote: >> > Thanks Sage! I will definitely try those patches. >> > >> > For this one, I finally managed to bring the new monitor in by >> > increasing the mon_sync_timeout from its default 60 to 6 to make >> > sure the syncing does not restart and result in an infinite loop.. >> > >> > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil wrote: >> >> On Fri, 13 Nov 2015, Guang Yang wrote: >> >>> Thanks Sage! >> >>> >> >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil wrote: >> >>> > On Fri, 13 Nov 2015, Guang Yang wrote: >> >>> >> I was wrong the previous analysis, it was not the iterator got reset, >> >>> >> the problem I can see now, is that during the syncing, a new round of >> >>> >> election kicked off and thus it needs to probe the newly added >> >>> >> monitor, however, since it hasn't been synced yet, it will restart the >> >>> >> syncing from there. >> >>> > >> >>> > What version of this? I think this is something we fixed a while back? >> >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there >> >>> a commit I can take a look? >> >> >> >> Hrm, I guess it was way befoer that.. I'm thinking of >> >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly. So I'm >> >> not sure exactly why an election would be restarting the sync in your >> >> case.. >> >> >> >> You mentioned elsewhere that your mon store was very large, though (more >> >> than 10's of GB), which suggests you might be hitting the >> >> min_last_epoch_clean problem (which prevents osdmap trimming).. see >> >> b41408302b6529a7856a3b0a08c35e5fa284882e. This was backported to hammer >> >> and firefly but not giant. >> >> >> >> sage >> >> >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newly added monitor infinitely sync store
I spoke to a leveldb expert, it looks like this is a known pattern on LSM tree data structure - the tail latency for range scan could be far longer than avg/median since it might need to mmap several sst files to get the record. Hi Sage, Do you see any harm to increase the default value for this setting (e.g. 20 minutes)? Or should I add the advise for monitor trouble-shooting? Thanks, Guang On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang wrote: > Thanks Sage! I will definitely try those patches. > > For this one, I finally managed to bring the new monitor in by > increasing the mon_sync_timeout from its default 60 to 6 to make > sure the syncing does not restart and result in an infinite loop.. > > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil wrote: >> On Fri, 13 Nov 2015, Guang Yang wrote: >>> Thanks Sage! >>> >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil wrote: >>> > On Fri, 13 Nov 2015, Guang Yang wrote: >>> >> I was wrong the previous analysis, it was not the iterator got reset, >>> >> the problem I can see now, is that during the syncing, a new round of >>> >> election kicked off and thus it needs to probe the newly added >>> >> monitor, however, since it hasn't been synced yet, it will restart the >>> >> syncing from there. >>> > >>> > What version of this? I think this is something we fixed a while back? >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there >>> a commit I can take a look? >> >> Hrm, I guess it was way befoer that.. I'm thinking of >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly. So I'm >> not sure exactly why an election would be restarting the sync in your >> case.. >> >> You mentioned elsewhere that your mon store was very large, though (more >> than 10's of GB), which suggests you might be hitting the >> min_last_epoch_clean problem (which prevents osdmap trimming).. see >> b41408302b6529a7856a3b0a08c35e5fa284882e. This was backported to hammer >> and firefly but not giant. >> >> sage >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newly added monitor infinitely sync store
Thanks Sage! I will definitely try those patches. For this one, I finally managed to bring the new monitor in by increasing the mon_sync_timeout from its default 60 to 6 to make sure the syncing does not restart and result in an infinite loop.. On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil wrote: > On Fri, 13 Nov 2015, Guang Yang wrote: >> Thanks Sage! >> >> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil wrote: >> > On Fri, 13 Nov 2015, Guang Yang wrote: >> >> I was wrong the previous analysis, it was not the iterator got reset, >> >> the problem I can see now, is that during the syncing, a new round of >> >> election kicked off and thus it needs to probe the newly added >> >> monitor, however, since it hasn't been synced yet, it will restart the >> >> syncing from there. >> > >> > What version of this? I think this is something we fixed a while back? >> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there >> a commit I can take a look? > > Hrm, I guess it was way befoer that.. I'm thinking of > b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly. So I'm > not sure exactly why an election would be restarting the sync in your > case.. > > You mentioned elsewhere that your mon store was very large, though (more > than 10's of GB), which suggests you might be hitting the > min_last_epoch_clean problem (which prevents osdmap trimming).. see > b41408302b6529a7856a3b0a08c35e5fa284882e. This was backported to hammer > and firefly but not giant. > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newly added monitor infinitely sync store
Thanks Sage! On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil wrote: > On Fri, 13 Nov 2015, Guang Yang wrote: >> I was wrong the previous analysis, it was not the iterator got reset, >> the problem I can see now, is that during the syncing, a new round of >> election kicked off and thus it needs to probe the newly added >> monitor, however, since it hasn't been synced yet, it will restart the >> syncing from there. > > What version of this? I think this is something we fixed a while back? This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there a commit I can take a look? > >> Hi Sage and Joao, >> Is there a way to freeze the election by some tunable to let the sync finish? > > We can't not do elections when something is asking for one (e.g., mon > is down). I see. Is there an operational workaround we could try? From within the log, I found the election was triggered by accepted timeout, thus I increased the timeout value to hopefully squeeze election during syncing, does that sounds a workaround? > > sage > > > >> >> Thanks, >> Guang >> >> On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang wrote: >> > Hi Joao, >> > We have a problem when trying to add new monitors to the cluster on an >> > unhealthy cluster, which I would like ask for your suggestion. >> > >> > After adding the new monitor, it started syncing the store and went >> > into an infinite loop: >> > >> > 2015-11-12 21:02:23.499510 7f1e8030e700 10 >> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk >> > cookie 4513071120 lc 14697737 bl 929616 bytes last_key >> > osdmap,full_22530) v2 >> > 2015-11-12 21:02:23.712944 7f1e8030e700 10 >> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk >> > cookie 4513071120 lc 14697737 bl 799897 bytes last_key >> > osdmap,full_3259) v2 >> > >> > >> > We talked early in the morning on IRC, and at the time I thought it >> > was because the osdmap epoch was increasing, which lead to this >> > infinite loop. >> > >> > I then set those nobackfill/norecovery flags and the osdmap epoch >> > freezed, however, the problem is still there. >> > >> > While the osdmap epoch is 22531, the switch always happened at >> > osdmap.full_22530 (as showed by the above log). >> > >> > Looking at the code at both sides, it looks this check >> > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389) >> > always true, and I can confirm from the log that (sp.last_commited < >> > paxos->get_version()) was false, so the chance is that the >> > sp.synchronizer always has next chunk? >> > >> > Does this look familiar to you? Or any other trouble shoot I can try? >> > Thanks very much. >> > >> > Thanks, >> > Guang >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Newly added monitor infinitely sync store
I was wrong the previous analysis, it was not the iterator got reset, the problem I can see now, is that during the syncing, a new round of election kicked off and thus it needs to probe the newly added monitor, however, since it hasn't been synced yet, it will restart the syncing from there. Hi Sage and Joao, Is there a way to freeze the election by some tunable to let the sync finish? Thanks, Guang On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang wrote: > Hi Joao, > We have a problem when trying to add new monitors to the cluster on an > unhealthy cluster, which I would like ask for your suggestion. > > After adding the new monitor, it started syncing the store and went > into an infinite loop: > > 2015-11-12 21:02:23.499510 7f1e8030e700 10 > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk > cookie 4513071120 lc 14697737 bl 929616 bytes last_key > osdmap,full_22530) v2 > 2015-11-12 21:02:23.712944 7f1e8030e700 10 > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk > cookie 4513071120 lc 14697737 bl 799897 bytes last_key > osdmap,full_3259) v2 > > > We talked early in the morning on IRC, and at the time I thought it > was because the osdmap epoch was increasing, which lead to this > infinite loop. > > I then set those nobackfill/norecovery flags and the osdmap epoch > freezed, however, the problem is still there. > > While the osdmap epoch is 22531, the switch always happened at > osdmap.full_22530 (as showed by the above log). > > Looking at the code at both sides, it looks this check > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389) > always true, and I can confirm from the log that (sp.last_commited < > paxos->get_version()) was false, so the chance is that the > sp.synchronizer always has next chunk? > > Does this look familiar to you? Or any other trouble shoot I can try? > Thanks very much. > > Thanks, > Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Hi Joao, We have a problem when trying to add new monitors to the cluster on an unhealthy cluster, which I would like ask for your suggestion. After adding the new monitor, it started syncing the store and went into an infinite loop: 2015-11-12 21:02:23.499510 7f1e8030e700 10 mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk cookie 4513071120 lc 14697737 bl 929616 bytes last_key osdmap,full_22530) v2 2015-11-12 21:02:23.712944 7f1e8030e700 10 mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk cookie 4513071120 lc 14697737 bl 799897 bytes last_key osdmap,full_3259) v2 We talked early in the morning on IRC, and at the time I thought it was because the osdmap epoch was increasing, which lead to this infinite loop. I then set those nobackfill/norecovery flags and the osdmap epoch freezed, however, the problem is still there. While the osdmap epoch is 22531, the switch always happened at osdmap.full_22530 (as showed by the above log). Looking at the code at both sides, it looks this check (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389) always true, and I can confirm from the log that (sp.last_commited < paxos->get_version()) was false, so the chance is that the sp.synchronizer always has next chunk? Does this look familiar to you? Or any other trouble shoot I can try? Thanks very much. Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbolic links like feature on radosgw
Hi Yehuda, We have a user requirement that needs symbolic link like feature on radosgw - two object ids pointing to the same object (ideally it could cross bucket, but same bucket is fine). The closest feature on Amazon S3 I could find is [1], but not exact the same, the one from Amazon S3 API was designed for static web site hosting. Is this a valid feature request we can put into radosgw? The way I am thinking to implement is like symbolic link, the link object just contains a pointer to the original object. [1] http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html -- Regards, Guang -- -- Regards, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Weekly performance meeting
On Sep 26, 2014, at 9:12 PM, Mark Nelson wrote: > On 09/25/2014 09:47 PM, Guang Yang wrote: >> Hi Sage, >> We are very interested to join (and contribute effort) as well. Following >> are a list of issues we have particular interests: >> 1> Large number of small files bring performance degradation most due to >> file system lookup (even worst with EC). > > Have you tried decreasing vfs_cache_pressure to retain dentries and inodes in > cache? I've had good luck improve performance for medium sized IO workloads > doing this. Yeah we changed the setting from its default value 100 to 20 and it turned out improvement for dentry/inode cache (we also tried setting it to 1 but got OOM in some traffic pattern). Even with the setting change, given the object size is several hundred KB, we still observed lookup miss which increase latency, this became worst when we turned to EC as: 1) More files on each system. 2) The long tail determine the latency. > >> 2> Messenger uses too many threads which bring burden for high density >> hardware (which I believe Haomai already has great progress). > > Yes, The biggest thing on my personal wish list has been to move to a hybrid > threading/event processing model. > >> >> Thanks, >> Guang >> >> On Sep 26, 2014, at 2:27 AM, Sage Weil wrote: >> >>> Hi everyone, >>> >>> A number of people have approached me about how to get more involved with >>> the current work on improving performance and how to better coordinate >>> with other interested parties. A few meetings have taken place offline >>> with good results but only a few interested parties were involved. >>> >>> Ideally, we'd like to move as much of this dicussion into the public >>> forums: ceph-devel@vger.kernel.org and #ceph-devel. That isn't always >>> sufficient, however. I'd like to also set up a regular weekly meeting >>> using google hangouts or bluejeans so that all interested parties can >>> share progress. There are a lot of things we can do during the Hammer >>> cycle to improve things but it will require some coordination of effort. >>> >>> Among other things, we can discuss: >>> >>> - observed performance limitations >>> - high level strategies for addressing them >>> - proposed patch sets and their performance impact >>> - anything else that will move us forward >>> >>> One challenge is timezones: there are developers in the US, China, Europe, >>> and Israel who may want to join. As a starting point, how about next >>> Wednesday, 15:00 UTC? If I didn't do my tz math wrong, that's >>> >>> 8:00 (PDT, California) >>> 15:00 (UTC) >>> 18:00 (IDT, Israel) >>> 23:00 (CST, China) >>> >>> That is surely not the ideal time for everyone but it can hopefully be a >>> starting point. >>> >>> I've also created an etherpad for collecting discussion/agenda items at >>> >>> http://pad.ceph.com/p/performance_weekly >>> >>> Is there interest here? Please let everyone know if you are actively >>> working in this area and/or would like to join, and update the pad above >>> with the topics you would like to discuss. >>> >>> Thanks! >>> sage >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Weekly performance meeting
Hi Sage, We are very interested to join (and contribute effort) as well. Following are a list of issues we have particular interests: 1> Large number of small files bring performance degradation most due to file system lookup (even worst with EC). 2> Messenger uses too many threads which bring burden for high density hardware (which I believe Haomai already has great progress). Thanks, Guang On Sep 26, 2014, at 2:27 AM, Sage Weil wrote: > Hi everyone, > > A number of people have approached me about how to get more involved with > the current work on improving performance and how to better coordinate > with other interested parties. A few meetings have taken place offline > with good results but only a few interested parties were involved. > > Ideally, we'd like to move as much of this dicussion into the public > forums: ceph-devel@vger.kernel.org and #ceph-devel. That isn't always > sufficient, however. I'd like to also set up a regular weekly meeting > using google hangouts or bluejeans so that all interested parties can > share progress. There are a lot of things we can do during the Hammer > cycle to improve things but it will require some coordination of effort. > > Among other things, we can discuss: > > - observed performance limitations > - high level strategies for addressing them > - proposed patch sets and their performance impact > - anything else that will move us forward > > One challenge is timezones: there are developers in the US, China, Europe, > and Israel who may want to join. As a starting point, how about next > Wednesday, 15:00 UTC? If I didn't do my tz math wrong, that's > > 8:00 (PDT, California) > 15:00 (UTC) > 18:00 (IDT, Israel) > 23:00 (CST, China) > > That is surely not the ideal time for everyone but it can hopefully be a > starting point. > > I've also created an etherpad for collecting discussion/agenda items at > > http://pad.ceph.com/p/performance_weekly > > Is there interest here? Please let everyone know if you are actively > working in this area and/or would like to join, and update the pad above > with the topics you would like to discuss. > > Thanks! > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RGW threads hung - more logs
Hi Sage, Sam and Greg, With the radosgw hung issue we discussed this today, I finally got some more logs showing that the reply message has been received by ragosgw, but failed to be dispatched as dispatcher thread was hung. I put all the logs into the tracker - http://tracker.ceph.com/issues/9008 While the logs explain what we observed, I failed to find any clue that why the dispatcher would need to wait for objecter_bytes throttler budget, did I miss anything obvious here? Tracker link - http://tracker.ceph.com/issues/9008 Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Issue - 8907
Hi Loic, Can you help to take a quick look over this issue - http://tracker.ceph.com/issues/8907, was it a design choice due to consistency concern? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD suicide after being down/in for one day as it needs to search large amount of objects
Thanks Sage. We will provide a patch based on this. Thanks, Guang On Aug 20, 2014, at 11:19 PM, Sage Weil wrote: > On Wed, 20 Aug 2014, Guang Yang wrote: >> Thanks Greg. >> On Aug 20, 2014, at 6:09 AM, Gregory Farnum wrote: >> >>> On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang wrote: >>>> Hi ceph-devel, >>>> David (cc?ed) reported a bug (http://tracker.ceph.com/issues/9128) which >>>> we came across in our test cluster during our failure testing, basically >>>> the way to reproduce it was to leave one OSD daemon down and in for a day, >>>> at the same time, keep giving write traffic. When the OSD daemon was >>>> started again, it hit suicide timeout and kill itself. >>>> >>>> After some analysis (details in the bug), David found that the op thread >>>> was busy searching for missing objects and once the volume to search >>>> increase, the thread is expected to work that long time, please refer to >>>> the bug for detailed logs. >>> >>> Can you talk a little more about what's going on here? At a quick >>> naive glance, I'm not seeing why leaving an OSD down and in should >>> require work based on the amount of write traffic. Perhaps if the rest >>> of the cluster was changing mappings?? >> We increased the down to out time interval from 5 minutes to 2 days to >> avoid migrating data back and forth which could increase latency, so >> that we target to mark OSD out manually. To achieve such, we are testing >> against some boundary cases to let the OSD down and in for like 1 day, >> however, when we try to bring it up again, it always failed due to hit >> the suicide timeout. > > Looking at the log snippet I see the PG had log range > > 5481'28667,5646'34066 > > Which is ~5500 log events. The default max is 10k. search_for_missing is > basically going to iterate over this list and check if the object is > present locally. > > If that's slow enough to trigger a suicide (which it seems to be), teh > fix is simple: as Greg says we just need to make it probe the internel > heartbeat code to indicate progress. In most contexts this is done by > passing a ThreadPool::TPHandle &handle into each method and then > calling handle.reset_tp_timeout() on each iteration. The same needs to be > done for search_for_missing... > > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD suicide after being down/in for one day as it needs to search large amount of objects
Thanks Greg. On Aug 20, 2014, at 6:09 AM, Gregory Farnum wrote: > On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang wrote: >> Hi ceph-devel, >> David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we >> came across in our test cluster during our failure testing, basically the >> way to reproduce it was to leave one OSD daemon down and in for a day, at >> the same time, keep giving write traffic. When the OSD daemon was started >> again, it hit suicide timeout and kill itself. >> >> After some analysis (details in the bug), David found that the op thread was >> busy searching for missing objects and once the volume to search increase, >> the thread is expected to work that long time, please refer to the bug for >> detailed logs. > > Can you talk a little more about what's going on here? At a quick > naive glance, I'm not seeing why leaving an OSD down and in should > require work based on the amount of write traffic. Perhaps if the rest > of the cluster was changing mappings…? We increased the down to out time interval from 5 minutes to 2 days to avoid migrating data back and forth which could increase latency, so that we target to mark OSD out manually. To achieve such, we are testing against some boundary cases to let the OSD down and in for like 1 day, however, when we try to bring it up again, it always failed due to hit the suicide timeout. > >> >> One simple fix is to let the op thread reset the suicide timeout >> periodically when it is doing long-time work, other fix might be to cut the >> work into smaller pieces? > > We do both of those things throughout the OSD (although I think the > first is simpler and more common); search for the accesses to > cct->get_heartbeat_map()->reset_timeout. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD suicide after being down/in for one day as it needs to search large amount of objects
Hi ceph-devel, David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we came across in our test cluster during our failure testing, basically the way to reproduce it was to leave one OSD daemon down and in for a day, at the same time, keep giving write traffic. When the OSD daemon was started again, it hit suicide timeout and kill itself. After some analysis (details in the bug), David found that the op thread was busy searching for missing objects and once the volume to search increase, the thread is expected to work that long time, please refer to the bug for detailed logs. One simple fix is to let the op thread reset the suicide timeout periodically when it is doing long-time work, other fix might be to cut the work into smaller pieces? Any suggestion is welcome. Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assert failure
Hi Huamin, Then it might be a totally different issue than the one I mentioned below, please file a bug to http://tracker.ceph.com/ with more details (the log before the daemon crashed). Thanks, Guang On Aug 16, 2014, at 5:36 AM, Huamin Chen wrote: > Thanks. I was running a single node ceph fs cluster on a VM. Each time the VM > is created, it downloads the latest bits and runs unit tests. There are many > mount and unmount during the tests. > This issue can be reliably reproduced in one of these tests. > > The test info can be found > > > - Original Message - > From: "Guang Yang" > To: "Huamin Chen" > Cc: "Ceph-devel" > Sent: Friday, August 15, 2014 2:23:12 PM > Subject: Re: assert failure > > + ceph-devel. > > Hi Huamin, > Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if > its peer has the old version, it could crash the new version as well. > > Thanks, > Guang > > On Aug 14, 2014, at 11:21 PM, Huamin Chen wrote: > >> Bad news, still there ... >> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time >> 2014-08-14 15:16:44.659312 >> msg/Pipe.cc: 1080: FAILED assert(m) >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac] >> 2: (Pipe::writer()+0x9f3) [0x7f327552aff3] >> 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d] >> 4: (()+0x79d1) [0x7f32953449d1] >> 5: (clone()+0x6d) [0x7f3294c89b5d] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> terminate called after throwing an instance of 'ceph::FailedAssertion' >> >> Attached please find all related logs >> >> - Original Message - >> From: "Guang Yang" >> To: "Huamin Chen" >> Cc: ceph-devel@vger.kernel.org >> Sent: Wednesday, August 13, 2014 10:39:10 PM >> Subject: Re: assert failure >> >> Hi Huamin, >> At least one known issue in 0.80.1 with the same failing pattern has been >> fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - >> http://tracker.ceph.com/issues/8232. >> >> Can you compare the log snippets from within the bug and see if they are the >> same issue? >> >> Thanks, >> Guang >> >> On Aug 14, 2014, at 4:29 AM, Huamin Chen wrote: >> >>> Is the following assert failure an known issue? >>> >>> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time >>> 2014-08-13 16:26:06.039799 >>> msg/Pipe.cc: 1070: FAILED assert(m) >>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >>> 1: (Pipe::connect()+0x390e) [0x7feee89cf99e] >>> 2: (Pipe::writer()+0x511) [0x7feee89d0fd1] >>> 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d] >>> 4: (()+0x7df3) [0x7fef336cadf3] >>> 5: (clone()+0x6d) [0x7fef32fe63dd] >>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >>> interpret this. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assert failure
+ ceph-devel. Hi Huamin, Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if its peer has the old version, it could crash the new version as well. Thanks, Guang On Aug 14, 2014, at 11:21 PM, Huamin Chen wrote: > Bad news, still there ... > msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time > 2014-08-14 15:16:44.659312 > msg/Pipe.cc: 1080: FAILED assert(m) > ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) > 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac] > 2: (Pipe::writer()+0x9f3) [0x7f327552aff3] > 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d] > 4: (()+0x79d1) [0x7f32953449d1] > 5: (clone()+0x6d) [0x7f3294c89b5d] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to > interpret this. > terminate called after throwing an instance of 'ceph::FailedAssertion' > > Attached please find all related logs > > - Original Message - > From: "Guang Yang" > To: "Huamin Chen" > Cc: ceph-devel@vger.kernel.org > Sent: Wednesday, August 13, 2014 10:39:10 PM > Subject: Re: assert failure > > Hi Huamin, > At least one known issue in 0.80.1 with the same failing pattern has been > fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - > http://tracker.ceph.com/issues/8232. > > Can you compare the log snippets from within the bug and see if they are the > same issue? > > Thanks, > Guang > > On Aug 14, 2014, at 4:29 AM, Huamin Chen wrote: > >> Is the following assert failure an known issue? >> >> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time >> 2014-08-13 16:26:06.039799 >> msg/Pipe.cc: 1070: FAILED assert(m) >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >> 1: (Pipe::connect()+0x390e) [0x7feee89cf99e] >> 2: (Pipe::writer()+0x511) [0x7feee89d0fd1] >> 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d] >> 4: (()+0x7df3) [0x7fef336cadf3] >> 5: (clone()+0x6d) [0x7fef32fe63dd] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD disk replacement best practise
Hi cephers, Most recently I am drafting the run books for OSD disk replacement, I think the rule of thumb is to reduce data migration (recover/backfill), and I thought the following procedure should achieve the purpose: 1. ceph osd out osd.XXX (mark it out to trigger data migration) 2. ceph osd rm osd.XXX 3. ceph auth rm osd.XXX 4. provision a new OSD which will take XXX as the OSD id and migrate data back. With the above procedure, the crush weight of the host never changed so that we can limit the data migration only for those which are neccesary. Does it make sense? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assert failure
Hi Huamin, At least one known issue in 0.80.1 with the same failing pattern has been fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - http://tracker.ceph.com/issues/8232. Can you compare the log snippets from within the bug and see if they are the same issue? Thanks, Guang On Aug 14, 2014, at 4:29 AM, Huamin Chen wrote: > Is the following assert failure an known issue? > > msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time > 2014-08-13 16:26:06.039799 > msg/Pipe.cc: 1070: FAILED assert(m) > ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > 1: (Pipe::connect()+0x390e) [0x7feee89cf99e] > 2: (Pipe::writer()+0x511) [0x7feee89d0fd1] > 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d] > 4: (()+0x7df3) [0x7fef336cadf3] > 5: (clone()+0x6d) [0x7fef32fe63dd] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to > interpret this. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bucket index sharding - IO throttle
Hi Yehuda, Can you help to review the latest patch with throttle mechanism you suggested. Thanks! Thanks, Guang On Aug 4, 2014, at 3:20 PM, Guang Yang wrote: > Hi Yehuda, > Here is the new pull request - https://github.com/ceph/ceph/pull/2187 > > Thanks, > Guang > On Jul 31, 2014, at 10:40 PM, Guang Yang wrote: > >> Thanks Yehuda. I will do that (sorry I was occupied by some other stuff >> recently but I will try my best to provide a patch as soon as possible). >> >> Thanks, >> Guang >> >> 在 2014年7月31日,上午1:00,Yehuda Sadeh 写道: >> >>> Can you send this code through a github pull request (or at least as a >>> patch)? It'lll be easier to review and comment. >>> >>> Thanks, >>> Yehuda >>> >>> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang wrote: >>>> +ceph-devel. >>>> >>>> Thanks, >>>> Guang >>>> >>>> On Jul 29, 2014, at 10:20 PM, Guang Yang wrote: >>>> >>>>> Hi Yehuda, >>>>> Per you review comment in terms of IO throttling for bucket index >>>>> operation, I prototyped the below code (details still need to polish), >>>>> can you take a look if that is right way to go? >>>>> >>>>> Another problem I came across is that >>>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket >>>>> index init op (below), is there anything I missed obviously here? >>>>> >>>>> Thanks, >>>>> Guang >>>>> >>>>> >>>>> class ClsBucketIndexAioThrottler { >>>>> protected: >>>>> int completed; >>>>> int ret_code; >>>>> IoCtx& io_ctx; >>>>> Mutex lock; >>>>> struct LockCond { >>>>> Mutex lock; >>>>> Cond cond; >>>>> LockCond() : lock("LockCond"), cond() {} >>>>> } lock_cond; >>>>> public: >>>>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx) >>>>> : completed(0), ret_code(0), io_ctx(_io_ctx), >>>>> lock("ClsBucketIndexAioThrottler"), lock_cond() {} >>>>> >>>>> virtual ~ClsBucketIndexAioThrottler() {} >>>>> virtual void do_next() = 0; >>>>> virtual bool is_completed () = 0; >>>>> >>>>> void complete(int ret) { >>>>> { >>>>> Mutex::Locker l(lock); >>>>> if (ret < 0) >>>>> ret_code = ret; >>>>> ++completed; >>>>> } >>>>> >>>>> lock_cond.lock.Lock(); >>>>> lock_cond.cond.Signal(); >>>>> lock_cond.lock.Unlock(); >>>>> } >>>>> >>>>> int get_ret_code () { >>>>> Mutex::Locker l(lock); >>>>> return ret_code; >>>>> } >>>>> >>>>> virtual int wait_completion() { >>>>> lock_cond.lock.Lock(); >>>>> while (1) { >>>>> if (is_completed()) { >>>>> lock_cond.lock.Unlock(); >>>>> return ret_code; >>>>> } >>>>> lock_cond.cond.Wait(lock_cond.lock); >>>>> lock_cond.lock.Lock(); >>>>> } >>>>> } >>>>> }; >>>>> >>>>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler { >>>>> protected: >>>>> vector bucket_objects; >>>>> vector::iterator iter_pos; >>>>> public: >>>>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector >>>>> _bucket_objs) >>>>> : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs), >>>>> iter_pos(bucket_objects.begin()) {} >>>>> >>>>> virtual bool is_completed() { >>>>> Mutex::Locker l(lock); >>>>> int sent = 0; >>>>> vector::iterator iter = bucket_objects.begin(); >>>>> for (; iter != iter_pos; ++iter) ++sent; >>>>> >>>>> return (sent == completed && >>>>> (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 >>>>> /*Failure*/)); >>>>> } >>>>> }; >>>>> >>>>> template >>>>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion { &g
Re: bucket index sharding - IO throttle
Hi Osier, I doubt the issue is related (the error message is connection failure), the below patch is pretty simple (and incomplete), what it does is to add a configuration to bucket meta info so that we can configure the number of shards on bucket basis (again, this is not included in the patch). The patch should be completely backward compatible which means if you don’t change the number of shards configuration, nothing should be changed for bucket creation/listing. My plan is to use this patch as a starting point to review, as the key building blocks are included in the patch and once it is passed review, I will create a bunch of following patches to implement the feature completely (which are mostly done in the previous big patch - https://github.com/ceph/ceph/pull/2013). I tested the patch locally in my cluster and it looks good for bucket creation. Thanks, Guang On Aug 6, 2014, at 12:38 PM, Osier Yang wrote: > > On 2014年08月04日 15:20, Guang Yang wrote: >> Hi Yehuda, >> Here is the new pull request - https://github.com/ceph/ceph/pull/2187 > > I simply applied the patch on git top, and the testing shows > "rest-bench" is completely > broken with the 2 patches: > > > root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench > --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC > --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http > --uri_style=path --bucket=cool0 --seconds=20 --concurrent-ios=50 > --block-size=204800 --show-time write > host=testing-s3gw0 > 2014-08-06 12:28:56.500235 7f1336645780 -1 did not load config file, > using default settings. > ERROR: failed to create bucket: ConnectionFailed > failed initializing benchmark > > The related debug log entry: > > 2014-08-06 12:29:48.137559 7fea62fcd700 20 state for > obj=.rgw:.bucket.meta.rest-bench-bucket:default.9738.2 is not atomic, > not appending atomic test > > After a short time, all the memory was eaten up: > > root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench > --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC > --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http > --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800 > --show-time write > -bash: fork: Cannot allocate memory > root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench > --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC > --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http > --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800 > --show-time write > -bash: fork: Cannot allocate memory > root@testing-s3gw0:~/s3-tests# free > -bash: fork: Cannot allocate memory > > A few mins later, the VM is completely unresponsible. And I had to > destroy it and restart again. > > Guang, how was your testing when creating the patches? > >> >> >> Thanks, >> Guang >> On Jul 31, 2014, at 10:40 PM, Guang Yang wrote: >> >>> Thanks Yehuda. I will do that (sorry I was occupied by some other stuff >>> recently but I will try my best to provide a patch as soon as possible). >>> >>> Thanks, >>> Guang >>> >>> 在 2014年7月31日,上午1:00,Yehuda Sadeh 写道: >>> >>>> Can you send this code through a github pull request (or at least as a >>>> patch)? It'lll be easier to review and comment. >>>> >>>> Thanks, >>>> Yehuda >>>> >>>> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang wrote: >>>>> +ceph-devel. >>>>> >>>>> Thanks, >>>>> Guang >>>>> >>>>> On Jul 29, 2014, at 10:20 PM, Guang Yang wrote: >>>>> >>>>>> Hi Yehuda, >>>>>> Per you review comment in terms of IO throttling for bucket index >>>>>> operation, I prototyped the below code (details still need to polish), >>>>>> can you take a look if that is right way to go? >>>>>> >>>>>> Another problem I came across is that >>>>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket >>>>>> index init op (below), is there anything I missed obviously here? >>>>>> >>>>>> Thanks, >>>>>> Guang >>>>>> >>>>>> >>>>>> class ClsBucketIndexAioThrottler { >>>>>> protected: >>>>>> int completed; >>>>>> int ret_code; >>>>>> IoCtx& io_ctx; >>>>>> Mutex lock; >>>>>> struct LockCond {
Re: KeyFileStore ?
在 2014年8月2日,上午5:34,Samuel Just 写道: > Sage's basic approach sounds about right to me. I'm fairly skeptical > about the benefits of packing small objects together within larger > files, though. It seems like for very small objects, we would be > better off stashing the contents opportunistically within the onode. I really like this idea, for radosgw + EC use case, there are lots of small physical files generated (multiple Kbs), and when the OSD disk is filled to a certain ratio, each read to one chunk could incur several disk I/Os (path lookup and data), and putting the data as part of onode could boost the read performance and as the same time, decrease the number of physical files. > For somewhat larger objects, it seems like the complexity of > maintaining information about the larger pack objects would be > equivalent to the what the filesystem would do anyway. > -Sam > > On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang wrote: >> I really like the idea, one scenario keeps bothering us is that there are >> too many small files which make the file system indexing slow (so that a >> single read request could take more than 10 disk IOs for path lookup). >> >> If we pursuit this proposal, is there a chance we can take one step further, >> that instead of storing one physical file for each object, we can allocate a >> big file (tens of GB) and each object only map to a chunk within that big >> file. So that all those big file’s description could be cached to avoid disk >> I/O to open the file. At least we keep it flexible that if someone would >> like to implement in such way, there is a chance to leverage the existing >> framework. >> >> Thanks, >> Guang >> >> On Jul 31, 2014, at 1:25 PM, Sage Weil wrote: >> >>> After the latest set of bug fixes to the FileStore file naming code I am >>> newly inspired to replace it with something less complex. Right now I'm >>> mostly thinking about HDDs, although some of this may map well onto hybrid >>> SSD/HDD as well. It may or may not make sense for pure flash. >>> >>> Anyway, here are the main flaws with the overall approach that FileStore >>> uses: >>> >>> - It tries to maintain a direct mapping of object names to file names. >>> This is problematic because of 255 character limits, rados namespaces, pg >>> prefixes, and the pg directory hashing we do to allow efficient split, for >>> starters. It is also problematic because we often want to do things like >>> rename but can't make it happen atomically in combination with the rest of >>> our transaction. >>> >>> - The PG directory hashing (that we do to allow efficient split) can have >>> a big impact on performance, particularly when injesting lots of data. >>> (And when benchmarking.) It's also complex. >>> >>> - We often overwrite or replace entire objects. These are "easy" >>> operations to do safely without doing complete data journaling, but the >>> current design is not conducive to doing anything clever (and it's complex >>> enough that I wouldn't want to add any cleverness on top). >>> >>> - Objects may contain only key/value data, but we still have to create an >>> inode for them and look that up first. This only matters for some >>> workloads (rgw indexes, cephfs directory objects). >>> >>> Instead, I think we should try a hybrid approach that more heavily >>> leverages a key/value db in combination with the file system. The kv db >>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just >>> assume it provides transactional key/value storage and efficient range >>> operations. Here's the basic idea: >>> >>> - The mapping from names to object lives in the kv db. The object >>> metadata is in a structure we can call an "onode" to avoid confusing it >>> with the inodes in the backing file system. The mapping is simple >>> ghobject_t -> onode map; there is no PG collection. The PG collection >>> still exist but really only as ranges of those keys. We will need to be >>> slightly clever with the coll_t to distinguish between "bare" PGs (that >>> live in this flat mapping) and the other collections (*_temp and >>> metadata), but that should be easy. This makes PG splitting "free" as far >>> as the objects go. >>> >>> - The onodes are relatively small. They will contain the xattrs and >>> basic metadata like object size. They will also iden
Re: bucket index sharding - IO throttle
Hi Yehuda, Here is the new pull request - https://github.com/ceph/ceph/pull/2187 Thanks, Guang On Jul 31, 2014, at 10:40 PM, Guang Yang wrote: > Thanks Yehuda. I will do that (sorry I was occupied by some other stuff > recently but I will try my best to provide a patch as soon as possible). > > Thanks, > Guang > > 在 2014年7月31日,上午1:00,Yehuda Sadeh 写道: > >> Can you send this code through a github pull request (or at least as a >> patch)? It'lll be easier to review and comment. >> >> Thanks, >> Yehuda >> >> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang wrote: >>> +ceph-devel. >>> >>> Thanks, >>> Guang >>> >>> On Jul 29, 2014, at 10:20 PM, Guang Yang wrote: >>> >>>> Hi Yehuda, >>>> Per you review comment in terms of IO throttling for bucket index >>>> operation, I prototyped the below code (details still need to polish), can >>>> you take a look if that is right way to go? >>>> >>>> Another problem I came across is that >>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index >>>> init op (below), is there anything I missed obviously here? >>>> >>>> Thanks, >>>> Guang >>>> >>>> >>>> class ClsBucketIndexAioThrottler { >>>> protected: >>>> int completed; >>>> int ret_code; >>>> IoCtx& io_ctx; >>>> Mutex lock; >>>> struct LockCond { >>>> Mutex lock; >>>> Cond cond; >>>> LockCond() : lock("LockCond"), cond() {} >>>> } lock_cond; >>>> public: >>>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx) >>>> : completed(0), ret_code(0), io_ctx(_io_ctx), >>>> lock("ClsBucketIndexAioThrottler"), lock_cond() {} >>>> >>>> virtual ~ClsBucketIndexAioThrottler() {} >>>> virtual void do_next() = 0; >>>> virtual bool is_completed () = 0; >>>> >>>> void complete(int ret) { >>>> { >>>>Mutex::Locker l(lock); >>>>if (ret < 0) >>>> ret_code = ret; >>>>++completed; >>>> } >>>> >>>> lock_cond.lock.Lock(); >>>> lock_cond.cond.Signal(); >>>> lock_cond.lock.Unlock(); >>>> } >>>> >>>> int get_ret_code () { >>>> Mutex::Locker l(lock); >>>> return ret_code; >>>> } >>>> >>>> virtual int wait_completion() { >>>> lock_cond.lock.Lock(); >>>> while (1) { >>>>if (is_completed()) { >>>> lock_cond.lock.Unlock(); >>>> return ret_code; >>>>} >>>>lock_cond.cond.Wait(lock_cond.lock); >>>>lock_cond.lock.Lock(); >>>> } >>>> } >>>> }; >>>> >>>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler { >>>> protected: >>>> vector bucket_objects; >>>> vector::iterator iter_pos; >>>> public: >>>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector >>>> _bucket_objs) >>>> : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs), >>>> iter_pos(bucket_objects.begin()) {} >>>> >>>> virtual bool is_completed() { >>>> Mutex::Locker l(lock); >>>> int sent = 0; >>>> vector::iterator iter = bucket_objects.begin(); >>>> for (; iter != iter_pos; ++iter) ++sent; >>>> >>>> return (sent == completed && >>>> (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 >>>> /*Failure*/)); >>>> } >>>> }; >>>> >>>> template >>>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion { >>>> private: >>>> T* data; >>>> // Return code of the operation >>>> int* ret_code; >>>> >>>> // The Aio completion object associated with this Op, it should >>>> // be release from within the completion handler >>>> librados::AioCompletion* completion; >>>> ClsBucketIndexAioThrottler* throttler; >>>> public: >>>> ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* >>>> _completion, >>>>ClsBucketIndexAioThrottler* _throttler) >&g
Re: KeyFileStore ?
I really like the idea, one scenario keeps bothering us is that there are too many small files which make the file system indexing slow (so that a single read request could take more than 10 disk IOs for path lookup). If we pursuit this proposal, is there a chance we can take one step further, that instead of storing one physical file for each object, we can allocate a big file (tens of GB) and each object only map to a chunk within that big file. So that all those big file’s description could be cached to avoid disk I/O to open the file. At least we keep it flexible that if someone would like to implement in such way, there is a chance to leverage the existing framework. Thanks, Guang On Jul 31, 2014, at 1:25 PM, Sage Weil wrote: > After the latest set of bug fixes to the FileStore file naming code I am > newly inspired to replace it with something less complex. Right now I'm > mostly thinking about HDDs, although some of this may map well onto hybrid > SSD/HDD as well. It may or may not make sense for pure flash. > > Anyway, here are the main flaws with the overall approach that FileStore > uses: > > - It tries to maintain a direct mapping of object names to file names. > This is problematic because of 255 character limits, rados namespaces, pg > prefixes, and the pg directory hashing we do to allow efficient split, for > starters. It is also problematic because we often want to do things like > rename but can't make it happen atomically in combination with the rest of > our transaction. > > - The PG directory hashing (that we do to allow efficient split) can have > a big impact on performance, particularly when injesting lots of data. > (And when benchmarking.) It's also complex. > > - We often overwrite or replace entire objects. These are "easy" > operations to do safely without doing complete data journaling, but the > current design is not conducive to doing anything clever (and it's complex > enough that I wouldn't want to add any cleverness on top). > > - Objects may contain only key/value data, but we still have to create an > inode for them and look that up first. This only matters for some > workloads (rgw indexes, cephfs directory objects). > > Instead, I think we should try a hybrid approach that more heavily > leverages a key/value db in combination with the file system. The kv db > might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just > assume it provides transactional key/value storage and efficient range > operations. Here's the basic idea: > > - The mapping from names to object lives in the kv db. The object > metadata is in a structure we can call an "onode" to avoid confusing it > with the inodes in the backing file system. The mapping is simple > ghobject_t -> onode map; there is no PG collection. The PG collection > still exist but really only as ranges of those keys. We will need to be > slightly clever with the coll_t to distinguish between "bare" PGs (that > live in this flat mapping) and the other collections (*_temp and > metadata), but that should be easy. This makes PG splitting "free" as far > as the objects go. > > - The onodes are relatively small. They will contain the xattrs and > basic metadata like object size. They will also identify the file name of > the backing file in the file system (if size > 0). > > - The backing file can be a random, short file name. We can just make a > one or two level deep set of directories, and let the directories get > reasonably big... whatever we decide the backing fs can handle > efficiently. We can also store a file handle in the onode and use the > open by handle API; this should let us go directly from onode (in our kv > db) to the on-disk inode without looking at the directory at all, and fall > back to using the actual file name only if that fails for some reason > (say, someone mucked around with the backing files). The backing file > need not have any xattrs on it at all (except perhaps some simple id to > verify it does it fact belong to the referring onode, just as a sanity > check). > > - The name -> onode mapping can live in a disjunct part of the kv > namespace so that the other kv stuff associated with the file (like omap > pairs or big xattrs or whatever) don't blow up those parts of the > db and slow down lookup. > > - We can keep a simple LRU of recent onodes in memory and avoid the kv > lookup for hot objects. > > - Previously complicated operations like rename are now trivial: we just > update the kv db with a transaction. The backing file never gets renamed, > ever, and the other object omap data is keyed by a unique (onode) id, not > the name. > > Initially, for simplicity, we can start with the existing data journaling > behavior. However, I think there are opportunities to improve the > situation there. There is a pending wip-transactions branch in which I > started to rejigger the ObjectStore::Trans
Re: bucket index sharding - IO throttle
Thanks Yehuda. I will do that (sorry I was occupied by some other stuff recently but I will try my best to provide a patch as soon as possible). Thanks, Guang 在 2014年7月31日,上午1:00,Yehuda Sadeh 写道: > Can you send this code through a github pull request (or at least as a > patch)? It'lll be easier to review and comment. > > Thanks, > Yehuda > > On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang wrote: >> +ceph-devel. >> >> Thanks, >> Guang >> >> On Jul 29, 2014, at 10:20 PM, Guang Yang wrote: >> >>> Hi Yehuda, >>> Per you review comment in terms of IO throttling for bucket index >>> operation, I prototyped the below code (details still need to polish), can >>> you take a look if that is right way to go? >>> >>> Another problem I came across is that >>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index >>> init op (below), is there anything I missed obviously here? >>> >>> Thanks, >>> Guang >>> >>> >>> class ClsBucketIndexAioThrottler { >>> protected: >>> int completed; >>> int ret_code; >>> IoCtx& io_ctx; >>> Mutex lock; >>> struct LockCond { >>> Mutex lock; >>> Cond cond; >>> LockCond() : lock("LockCond"), cond() {} >>> } lock_cond; >>> public: >>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx) >>> : completed(0), ret_code(0), io_ctx(_io_ctx), >>> lock("ClsBucketIndexAioThrottler"), lock_cond() {} >>> >>> virtual ~ClsBucketIndexAioThrottler() {} >>> virtual void do_next() = 0; >>> virtual bool is_completed () = 0; >>> >>> void complete(int ret) { >>> { >>> Mutex::Locker l(lock); >>> if (ret < 0) >>> ret_code = ret; >>> ++completed; >>> } >>> >>> lock_cond.lock.Lock(); >>> lock_cond.cond.Signal(); >>> lock_cond.lock.Unlock(); >>> } >>> >>> int get_ret_code () { >>> Mutex::Locker l(lock); >>> return ret_code; >>> } >>> >>> virtual int wait_completion() { >>> lock_cond.lock.Lock(); >>> while (1) { >>> if (is_completed()) { >>> lock_cond.lock.Unlock(); >>> return ret_code; >>> } >>> lock_cond.cond.Wait(lock_cond.lock); >>> lock_cond.lock.Lock(); >>> } >>> } >>> }; >>> >>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler { >>> protected: >>> vector bucket_objects; >>> vector::iterator iter_pos; >>> public: >>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector >>> _bucket_objs) >>> : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs), >>> iter_pos(bucket_objects.begin()) {} >>> >>> virtual bool is_completed() { >>> Mutex::Locker l(lock); >>> int sent = 0; >>> vector::iterator iter = bucket_objects.begin(); >>> for (; iter != iter_pos; ++iter) ++sent; >>> >>> return (sent == completed && >>> (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 >>> /*Failure*/)); >>> } >>> }; >>> >>> template >>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion { >>> private: >>> T* data; >>> // Return code of the operation >>> int* ret_code; >>> >>> // The Aio completion object associated with this Op, it should >>> // be release from within the completion handler >>> librados::AioCompletion* completion; >>> ClsBucketIndexAioThrottler* throttler; >>> public: >>> ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* >>> _completion, >>> ClsBucketIndexAioThrottler* _throttler) >>> : data(_data), ret_code(_ret_code), completion(_completion), >>> throttler(_throttler) {} >>> ~ClsBucketIndexOpCtx() {} >>> >>> // The completion callback, fill the response data >>> void handle_completion(int r, bufferlist& outbl) { >>> if (r >= 0) { >>> if (data) { >>> try { >>> bufferlist::iterator iter = outbl.begin(); >>> ::decode((*data), iter); >>> } catch (buffer::error& err) { >>> r = -EIO; >>> } >>> } >&
Re: bucket index sharding - IO throttle
+ceph-devel. Thanks, Guang On Jul 29, 2014, at 10:20 PM, Guang Yang wrote: > Hi Yehuda, > Per you review comment in terms of IO throttling for bucket index operation, > I prototyped the below code (details still need to polish), can you take a > look if that is right way to go? > > Another problem I came across is that ClsBucketIndexOpCtx::handle_compeltion > was not called for the bucket index init op (below), is there anything I > missed obviously here? > > Thanks, > Guang > > > class ClsBucketIndexAioThrottler { > protected: > int completed; > int ret_code; > IoCtx& io_ctx; > Mutex lock; > struct LockCond { >Mutex lock; >Cond cond; >LockCond() : lock("LockCond"), cond() {} > } lock_cond; > public: > ClsBucketIndexAioThrottler(IoCtx& _io_ctx) >: completed(0), ret_code(0), io_ctx(_io_ctx), >lock("ClsBucketIndexAioThrottler"), lock_cond() {} > > virtual ~ClsBucketIndexAioThrottler() {} > virtual void do_next() = 0; > virtual bool is_completed () = 0; > > void complete(int ret) { >{ > Mutex::Locker l(lock); > if (ret < 0) >ret_code = ret; > ++completed; >} > >lock_cond.lock.Lock(); >lock_cond.cond.Signal(); >lock_cond.lock.Unlock(); > } > > int get_ret_code () { >Mutex::Locker l(lock); >return ret_code; > } > > virtual int wait_completion() { >lock_cond.lock.Lock(); >while (1) { > if (is_completed()) { >lock_cond.lock.Unlock(); >return ret_code; > } > lock_cond.cond.Wait(lock_cond.lock); > lock_cond.lock.Lock(); >} > } > }; > > class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler { > protected: > vector bucket_objects; > vector::iterator iter_pos; > public: > ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector > _bucket_objs) >: ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs), >iter_pos(bucket_objects.begin()) {} > > virtual bool is_completed() { >Mutex::Locker l(lock); >int sent = 0; >vector::iterator iter = bucket_objects.begin(); >for (; iter != iter_pos; ++iter) ++sent; > >return (sent == completed && >(iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 > /*Failure*/)); > } > }; > > template > class ClsBucketIndexOpCtx : public ObjectOperationCompletion { > private: > T* data; > // Return code of the operation > int* ret_code; > > // The Aio completion object associated with this Op, it should > // be release from within the completion handler > librados::AioCompletion* completion; > ClsBucketIndexAioThrottler* throttler; > public: > ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* > _completion, > ClsBucketIndexAioThrottler* _throttler) >: data(_data), ret_code(_ret_code), completion(_completion), > throttler(_throttler) {} > ~ClsBucketIndexOpCtx() {} > > // The completion callback, fill the response data > void handle_completion(int r, bufferlist& outbl) { >if (r >= 0) { > if (data) { >try { > bufferlist::iterator iter = outbl.begin(); > ::decode((*data), iter); >} catch (buffer::error& err) { > r = -EIO; >} > } > // Do the next request >} >throttler->do_next(); >throttler->complete(r); >if (completion) { > completion->release(); >} > } > }; > > > class ClsBucketIndexInitAioThrottler : public ClsBucketIndexListAioThrottler { > public: > ClsBucketIndexInitAioThrottler(IoCtx& _io_ctx, const vector > _bucket_objs) : >ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {} > > virtual void do_next() { >string oid; >{ > Mutex::Locker l(lock); > if (iter_pos == bucket_objects.end()) >return; > oid = *(iter_pos++); >} >AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, > NULL); >// Dummy >bufferlist in; >librados::ObjectWriteOperation op; >op.create(true); >op.exec("rgw", "bucket_init_index", in, new ClsBucketIndexOpCtx(NULL, > NULL, c, this)); >io_ctx.aio_operate(oid, c, &op, NULL); > } > }; > > > int cls_rgw_bucket_index_init_op(librados::IoCtx &io_ctx, >const vector& bucket_objs, uint32_t max_aio) > { > vector::const_iterator iter = bucket_objs.begin(); > bufferlist in; > ClsBucketIndexAioThrottler* throttler = new > ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs); > for (; iter != bucket_objs.end() && max_aio-- > 0; ++iter) { > throttler->do_next(); > } > throttler->wait_completion(); > return 0; > } > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
row geo-replication to another data store?
Hi cephers, We are investigating a backup solution for Ceph, in short, we would like a solution to backup a Ceph cluster to another data store (not Ceph cluster, assume it has SWIFT API). We would like to have both full backup and incremental backup on top of the full backup. After going through the geo-replication blueprint [1], I am thinking that we can leverage the effort and instead of replicate the data into another ceph cluster, we make it replicate to another data store. At the same time, I have a couple of questions which need your help: 1) How does the ragosgw-agent scale to multiple hosts? Our first investigation shows it only works on a single host but I would like to confirm. 2) Can we configure the interval to do incremental backup like 1 hour / 1 day / 1 month? [1] https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EC pool - empty files in OSD from radosgw
Hi Yehuda and Sam, Any suggestion on top of the below issue? Thanks, Guang On Jul 12, 2014, at 12:43 AM, Guang Yang wrote: > Hi Loic, > I opened an issue in terms of a change brought along with EC pool plus > radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we > observed a large number of empty files in OSD and the root cause is that for > head object from radosgw, there are a couple of transactions coming together, > including create 0~0, setxattr, writefull, as EC bring the concept of the > object generation, the create transaction will first create an object and the > following write full transition will be taken as an update and rename the > original empty file to a generation and create/write a new file. As a result, > we observed quite some empty files. > > There is an bug tracking the effort to remove those files with generation and > is pending to be back port to firefly, that could definitely help our use > case, however, I am also wondering if there is any room to improve for such > case so that those empty files would not be generated in the first place > (change might be at radosgw side). > > Any suggestion is welcomed. > > Thanks, > Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
EC pool - empty files in OSD from radosgw
Hi Loic, I opened an issue in terms of a change brought along with EC pool plus radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we observed a large number of empty files in OSD and the root cause is that for head object from radosgw, there are a couple of transactions coming together, including create 0~0, setxattr, writefull, as EC bring the concept of the object generation, the create transaction will first create an object and the following write full transition will be taken as an update and rename the original empty file to a generation and create/write a new file. As a result, we observed quite some empty files. There is an bug tracking the effort to remove those files with generation and is pending to be back port to firefly, that could definitely help our use case, however, I am also wondering if there is any room to improve for such case so that those empty files would not be generated in the first place (change might be at radosgw side). Any suggestion is welcomed. Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.80.2?
Hi Sage, Is it possible to include a fix for this bug - http://tracker.ceph.com/issues/8733 considering the scope of the change and regression risk for the next release? We are finalizing our production launch version and this one is a blocker as we use EC pool. Thanks, Guang On Jul 11, 2014, at 7:31 AM, Sage Weil wrote: > We built v0.80.2 yesterday and pushed it out to the repos, but quickly > discovered a regression in radosgw that preventing reading objects written > with earlier versions. We pulled the packages, fixed the bug, and are > rerunning tests to confirm the fix and ensure there aren't other > upgrade-related issues. We expect to have a v0.80.3 ready tomorrow or > Monday. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw - bucket index sharding merge back
Hi Yehuda, I am trying to find a way to merge back the bucket index sharding effort, and with more experience working at Ceph, I realized that the original commit was too huge which introduced trouble for review. I am thinking to break it down into multiple small commits and commit back with a number of patches. I have two questions here: 1) Have you looked at the patch and any suggestion I should pay attention to when doing the split? 2) Can we merge back with a series of patches (e.g. several commits one patch)? Any suggestion that I should pay attention to so as to drive this effort into completion? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw - bucket index sharding merge back
Hi Yehuda, I am trying to find a way to merge back the bucket index sharding effort, and with more experience working at Ceph, I realized that the original commit was too huge which introduced trouble for review. I am thinking to break it down into multiple small commits and commit back with a number of patches. I have two questions here: 1) Have you looked at the patch and any suggestion I should pay attention to when doing the split? 2) Can we merge back with a series of patches (e.g. several commits one patch)? Any suggestion that I should pay attention to so as to drive this effort into completion? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CDS G/H - bucket index sharding
Thanks Yehuda, my comments inline... On Jun 23, 2014, at 10:44 PM, Yehuda Sadeh wrote: > On Mon, Jun 23, 2014 at 4:11 AM, Guang Yang wrote: >> Hello Yehuda, >> I drafted a brief summary for the status of the bucket index sharding >> blueprint and put it here - >> http://pad.ceph.com/p/GH-bucket-index-scalability, it would be nice you can >> take a look to see if there is anything I missed, I also posted the pull >> request here - https://github.com/ceph/ceph/pull/2013. > > Just one note regarding the blueprint, other BI log operations will > need to use the new schema too (e.g., log trim operations). Yeah, that has been implemented, thanks for pointing it out. > > I was thinking a bit about how to do resizing and dynamic sharding > later on. My thought was that we'd have two bucket prefixes: one for > read and delete operations, and one for read, write and delete > operations. Normally both will point at the same prefix and we'll just > access a single one. But when we're resizing we'll need to use both. > If we're listing objects we'll access both sets of shards and merge > everything). If we're creating object we'll just create it in the > second one. Removing object, we'll remove it from both. > The above description is a bit vague, and shouldn't really change what > we do now. Just that the implementation needs to maybe abstract that > bucket access decision nicely so that in the future we could implement > this easily. Considering the tradeoff we have with multiple shards for bucket index object, we are not likely to create a large number of shards (except we add something like per shard listing), thus it might make sense to start with the upper bound directly (e.g. 50), it might be good enough for most use cases. Another direction we may explore is to let user specify the number of shards (e.g. via user defined metadata), when he/she has an estimation of the number of objects for a bucket. As for dynamic bucket, I think there are two options, one is with no data migration when changing the number of shards (thus there might be multiple version of truth), another is to have data migration. The approach mentioned above is the first one, we should be able to implement the above approach with some aggregation at client side for multiple version of truth. > > Sadly I'll be off for this CDS, but I'm sure Josh, Greg, Sage, and > others will be able to help there. > > Thanks, > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
XFS - number of files in a directory
Hello Cephers, We used to have a Ceph cluster and setup our data pool as 3 replicas, we estimated the number of files (given disk size and object size) for each PG was around 8K, we disabled folder splitting which mean all files located at the root PG folder. Our testing showed a good performance with such setup. Right now we are evaluating erasure coding, which split the object into a number of chunks and increase the number of files several times, although XFS claims a good support for large directories [1], some testing also showed that we may expect performance degradation for large directories. I would like to check with your experience on top of this for your Ceph cluster if you are using XFS. Thanks. [1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CDS G/H - bucket index sharding
Hello Yehuda, I drafted a brief summary for the status of the bucket index sharding blueprint and put it here - http://pad.ceph.com/p/GH-bucket-index-scalability, it would be nice you can take a look to see if there is anything I missed, I also posted the pull request here - https://github.com/ceph/ceph/pull/2013. Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PG folder splitting proposal
Hi Sage, Would you please help to comment on the proposal of comment 7 of this ticket - http://tracker.ceph.com/issues/7593#note-7 ? As we move to EC pool, the number of files for each PG increase X times which makes folder splitting more likely to happen. Do you think it is a worthy change to do pre-splitting during pool creation time with a hint as pool meta-data? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Changes of scrubbing?
On Jun 11, 2014, at 10:02 PM, Gregory Farnum wrote: > On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang wrote: >> On Jun 11, 2014, at 6:33 AM, Gregory Farnum wrote: >> >>> On Tue, May 20, 2014 at 6:44 PM, Guang Yang wrote: >>>> Hi ceph-devel, >>>> Like some users of Ceph, we are using Ceph for a latency sensitive >>>> project, and scrubbing (especially deep-scrubbing) impacts the SLA in a >>>> non-trivial way, as commodity hardware could fail in one way or the other, >>>> I think it is essential to have scrubbing enabled to preserve data >>>> durability. >>>> >>>> Inspired by how erasure coding backend implement scrubbing[1], I am >>>> wondering if the following changes is valid to somehow reduce the >>>> performance impact from scrubbing: >>>> 1. Store the CRC checksum along with each physical copy of the object on >>>> filesystem (via xattr or omap?) >>>> 2. For read request, it checks the CRC locally and if it mismatch, >>>> redirect the request to a replica and mark the PG as inconsistent. >>> >>> The problem with this is that you need to maintain the CRC across >>> partial overwrites of the object. And the real cost of scrubbing isn't >>> in the network traffic, it's in the disk reads, which you would have >>> to do anyway with this method. :) >> Thanks Greg for the response! >> Partial update is the right concern if that happens frequently. However, the >> major benefit of this proposal is to postpone the CRC check to READ request >> instead of doing it from within a background job (although we may still need >> to do background check as deep-scrubbing, we can reduce the frequency >> dramatically). By checking the CRC at read time, in-consistent object are >> marked as inconsistent (PG) and further we can trigger a repair for the PG. > > Oh, I see. > Still, partial update is in fact the major concern. We have a debug > mechanism called "sloppy crc" or similar that keeps track of them for > full (or sufficiently large?) writes, but it's not something you can > use on production cluster because it turns every write into a > read-modify-write cycle, and that's just prohibitively expensive (in > addition to issues with stuff like OSD restart, I think). This sort of > thing would make sense for the erasure-coded pools; maybe that would > be a better place to start? Yeah, that sounds like a good starting point, let me see if I can spend some time doing a simple POC. Thanks Greg. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Changes of scrubbing?
On Jun 11, 2014, at 6:33 AM, Gregory Farnum wrote: > On Tue, May 20, 2014 at 6:44 PM, Guang Yang wrote: >> Hi ceph-devel, >> Like some users of Ceph, we are using Ceph for a latency sensitive project, >> and scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial >> way, as commodity hardware could fail in one way or the other, I think it is >> essential to have scrubbing enabled to preserve data durability. >> >> Inspired by how erasure coding backend implement scrubbing[1], I am >> wondering if the following changes is valid to somehow reduce the >> performance impact from scrubbing: >> 1. Store the CRC checksum along with each physical copy of the object on >> filesystem (via xattr or omap?) >> 2. For read request, it checks the CRC locally and if it mismatch, redirect >> the request to a replica and mark the PG as inconsistent. > > The problem with this is that you need to maintain the CRC across > partial overwrites of the object. And the real cost of scrubbing isn't > in the network traffic, it's in the disk reads, which you would have > to do anyway with this method. :) Thanks Greg for the response! Partial update is the right concern if that happens frequently. However, the major benefit of this proposal is to postpone the CRC check to READ request instead of doing it from within a background job (although we may still need to do background check as deep-scrubbing, we can reduce the frequency dramatically). By checking the CRC at read time, in-consistent object are marked as inconsistent (PG) and further we can trigger a repair for the PG. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > >> >> This is just a general idea and details (like append) will need to further >> discussed. >> >> By having such, we can schedule the scrubbing less aggresively but still >> preserve the durability for read. >> >> Does this make some sense? >> >> [1] http://ceph.com/docs/master/dev/osd_internals/erasure_coding/pgbackend/ >> >> Thanks, >> Guang Yang-- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw - bucket index
Hi Yehuda, Can you take a look at a very high level of the code change, here is the pull request - https://github.com/ceph/ceph/pull/1929. If things look good to you, i will continue the effort and make it more clear/complete by end of next week. Thanks, Guang On Jun 2, 2014, at 9:37 PM, Guang Yang wrote: > Hi Yehuda and Sage, > Can you help to comment on the ticket, I would like to send out a pull > request some time this week for you to review, but before that, it would be > nice to see your comments in terms of the interface and any other concerns > you may have for this. Thanks. > > Thanks, > Guang > > > On May 30, 2014, at 8:35 AM, Guang Yang wrote: > >> Hi Yehuda, >> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to >> review and comment. >> >> Thanks, >> Guang >> >> On May 19, 2014, at 2:47 PM, Yehuda Sadeh wrote: >> >>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang wrote: >>>> On May 19, 2014, at 7:05 AM, Sage Weil wrote: >>>> >>>>> On Sun, 18 May 2014, Guang wrote: >>>>>>>> radosgw is using the omap key/value API for objects, which is more or >>>>>>>> less >>>>>>>> equivalent to what swift is doing with sqlite. This data passes >>>>>>>> straight >>>>>>>> into leveldb on the backend (or whatever other backend you are using). >>>>>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>>>>> unmerged patches to do that; the user would just need to adjust their >>>>>>>> crush map so that the rgw index pool is mapped to a different set of >>>>>>>> OSDs >>>>>>>> with the better k/v backend. >>>>>> Not sure if I miss anything, but the key difference with SWIFT?s >>>>>> implementation is that they are using a table for bucket index and it >>>>>> actually can be updated in parallel which makes more scalable for write, >>>>>> though at certain point the sql table would result in performance >>>>>> degradation as well. >>>>> >>>>> As I understand it the same limitation is present there too: the index is >>>>> in a single sqlite table. >>>>> >>>>>>> My more well-formed opinion is that we need to come up with a good >>>>>>> design. It needs to be flexible enough to be able to grow (and maybe >>>>>>> shrink), and I assume there would be some kind of background operation >>>>>>> that will enable that. I also believe that making it hash based is the >>>>>>> way to go. It looks like that the more complicated issue is here is >>>>>>> how to handle the transition in which we shard buckets. >>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>>>>> list (so that it enable prefix scan for listing purpose) and we want to >>>>>> shard at the very beginning (the problem we are facing is parallel >>>>>> writes updating the same bucket index object will need to be >>>>>> serialized). >>>>> >>>>> Given how infrequent container listings are, pre-sharding containers >>>>> across several objects makes some sense. Paying the cost of doing >>>>> listings in parallel across N (where N is not too big) is not a big price >>>>> to pay. However, there will always need to be a way to re-shard further >>>>> when containers/buckets get extremely big. Perhaps a starting point would >>>>> be support for static sharding where the number of shards is specified at >>>>> container/bucket creation time… >>>> Considering the scope of the change, I also think this is a good starting >>>> point to make the bucket index updating more scalable. >>>> Yehuda, >>>> How do you think? >>> >>> Sharding it will help with scaling it up to a certain point. As Sage >>> mentioned we can start with a static setting as a first simpler >>> approach, and move into a dynamic approach later on. >>> >>> Yehuda >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw - bucket index
Hi Yehuda and Sage, Can you help to comment on the ticket, I would like to send out a pull request some time this week for you to review, but before that, it would be nice to see your comments in terms of the interface and any other concerns you may have for this. Thanks. Thanks, Guang On May 30, 2014, at 8:35 AM, Guang Yang wrote: > Hi Yehuda, > I opened an issue here: http://tracker.ceph.com/issues/8473, please help to > review and comment. > > Thanks, > Guang > > On May 19, 2014, at 2:47 PM, Yehuda Sadeh wrote: > >> On Sun, May 18, 2014 at 11:18 PM, Guang Yang wrote: >>> On May 19, 2014, at 7:05 AM, Sage Weil wrote: >>> >>>> On Sun, 18 May 2014, Guang wrote: >>>>>>> radosgw is using the omap key/value API for objects, which is more or >>>>>>> less >>>>>>> equivalent to what swift is doing with sqlite. This data passes >>>>>>> straight >>>>>>> into leveldb on the backend (or whatever other backend you are using). >>>>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>>>> unmerged patches to do that; the user would just need to adjust their >>>>>>> crush map so that the rgw index pool is mapped to a different set of >>>>>>> OSDs >>>>>>> with the better k/v backend. >>>>> Not sure if I miss anything, but the key difference with SWIFT?s >>>>> implementation is that they are using a table for bucket index and it >>>>> actually can be updated in parallel which makes more scalable for write, >>>>> though at certain point the sql table would result in performance >>>>> degradation as well. >>>> >>>> As I understand it the same limitation is present there too: the index is >>>> in a single sqlite table. >>>> >>>>>> My more well-formed opinion is that we need to come up with a good >>>>>> design. It needs to be flexible enough to be able to grow (and maybe >>>>>> shrink), and I assume there would be some kind of background operation >>>>>> that will enable that. I also believe that making it hash based is the >>>>>> way to go. It looks like that the more complicated issue is here is >>>>>> how to handle the transition in which we shard buckets. >>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>>>> list (so that it enable prefix scan for listing purpose) and we want to >>>>> shard at the very beginning (the problem we are facing is parallel >>>>> writes updating the same bucket index object will need to be >>>>> serialized). >>>> >>>> Given how infrequent container listings are, pre-sharding containers >>>> across several objects makes some sense. Paying the cost of doing >>>> listings in parallel across N (where N is not too big) is not a big price >>>> to pay. However, there will always need to be a way to re-shard further >>>> when containers/buckets get extremely big. Perhaps a starting point would >>>> be support for static sharding where the number of shards is specified at >>>> container/bucket creation time… >>> Considering the scope of the change, I also think this is a good starting >>> point to make the bucket index updating more scalable. >>> Yehuda, >>> How do you think? >> >> Sharding it will help with scaling it up to a certain point. As Sage >> mentioned we can start with a static setting as a first simpler >> approach, and move into a dynamic approach later on. >> >> Yehuda >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw - bucket index
Hi Yehuda, I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment. Thanks, Guang On May 19, 2014, at 2:47 PM, Yehuda Sadeh wrote: > On Sun, May 18, 2014 at 11:18 PM, Guang Yang wrote: >> On May 19, 2014, at 7:05 AM, Sage Weil wrote: >> >>> On Sun, 18 May 2014, Guang wrote: >>>>>> radosgw is using the omap key/value API for objects, which is more or >>>>>> less >>>>>> equivalent to what swift is doing with sqlite. This data passes straight >>>>>> into leveldb on the backend (or whatever other backend you are using). >>>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>>> unmerged patches to do that; the user would just need to adjust their >>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs >>>>>> with the better k/v backend. >>>> Not sure if I miss anything, but the key difference with SWIFT?s >>>> implementation is that they are using a table for bucket index and it >>>> actually can be updated in parallel which makes more scalable for write, >>>> though at certain point the sql table would result in performance >>>> degradation as well. >>> >>> As I understand it the same limitation is present there too: the index is >>> in a single sqlite table. >>> >>>>> My more well-formed opinion is that we need to come up with a good >>>>> design. It needs to be flexible enough to be able to grow (and maybe >>>>> shrink), and I assume there would be some kind of background operation >>>>> that will enable that. I also believe that making it hash based is the >>>>> way to go. It looks like that the more complicated issue is here is >>>>> how to handle the transition in which we shard buckets. >>>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>>> list (so that it enable prefix scan for listing purpose) and we want to >>>> shard at the very beginning (the problem we are facing is parallel >>>> writes updating the same bucket index object will need to be >>>> serialized). >>> >>> Given how infrequent container listings are, pre-sharding containers >>> across several objects makes some sense. Paying the cost of doing >>> listings in parallel across N (where N is not too big) is not a big price >>> to pay. However, there will always need to be a way to re-shard further >>> when containers/buckets get extremely big. Perhaps a starting point would >>> be support for static sharding where the number of shards is specified at >>> container/bucket creation time… >> Considering the scope of the change, I also think this is a good starting >> point to make the bucket index updating more scalable. >> Yehuda, >> How do you think? > > Sharding it will help with scaling it up to a certain point. As Sage > mentioned we can start with a static setting as a first simpler > approach, and move into a dynamic approach later on. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure code & reliability model
Thanks Koleos! Guang On May 26, 2014, at 7:02 PM, Koleos Fuskus wrote: > Hello Guang, > > Here is the wiki: > https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_modeling > > Koleos > > > > On Monday, May 26, 2014 10:05 AM, Guang Yang wrote: > Hello Loic and Koleos, > Do we have any wiki page documenting the progress and report of this effort? > We are very interested in such as well. > > Thanks, > Guang > > > > > On May 8, 2014, at 1:19 AM, Loic Dachary wrote: > >> >> Hi, >> >> On 07/05/2014 18:43, Koleos Fuskus wrote:> Hi Loic, >>> >>> What do you mean by action plan? if is the schedule, it is published on my >>> proposal in melange site. Indeed, the details of my proposal are private, >>> if is that what you mean, I can added on the wiki. If you want this part to >>> be public no problem. >> >> Yes, that's what I meant :-) You could collect snippets from the proposal >> posted on melange and present them on a subpage >> https://wiki.ceph.com/Development and we can use this as the "home page" of >> your work ? >> >>> Actually I am using some of my free time on reading ceph documentation (web >>> and papers). Do you have any specific document to recommend me? Maybe we >>> can discuss again about the durability model on Friday/Monday. I need some >>> more understanding about the ceph architecture. >> >> I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to >> chat during this "community bonding" phase (I don't remember how google >> calls this for GSoC participants ;-) >> >> Cheers >> >>> >>> Cheers, >>> Verónica >>> On Wednesday, May 7, 2014 8:12 AM, Loic Dachary wrote: >>> Hi Veronica, >>> >>> I was really happy to hear you're going to work on the erasure code >>> reliability model. Unless I'm mistaken your action plan was not published. >>> Would you mind adding it to the wiki so I can comment on it ? Someone else >>> might be interested to contribute too. I've had a discussion yesterday >>> about durability models (internally) and it is not well understood. Your >>> insight would be precious. >>> >>> Cheers >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> >>> >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: erasure code & reliability model
Hello Loic and Koleos, Do we have any wiki page documenting the progress and report of this effort? We are very interested in such as well. Thanks, Guang On May 8, 2014, at 1:19 AM, Loic Dachary wrote: > > Hi, > > On 07/05/2014 18:43, Koleos Fuskus wrote:> Hi Loic, >> >> What do you mean by action plan? if is the schedule, it is published on my >> proposal in melange site. Indeed, the details of my proposal are private, if >> is that what you mean, I can added on the wiki. If you want this part to be >> public no problem. > > Yes, that's what I meant :-) You could collect snippets from the proposal > posted on melange and present them on a subpage > https://wiki.ceph.com/Development and we can use this as the "home page" of > your work ? > >> Actually I am using some of my free time on reading ceph documentation (web >> and papers). Do you have any specific document to recommend me? Maybe we can >> discuss again about the durability model on Friday/Monday. I need some more >> understanding about the ceph architecture. > > I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to > chat during this "community bonding" phase (I don't remember how google calls > this for GSoC participants ;-) > > Cheers > >> >> Cheers, >> Verónica >> On Wednesday, May 7, 2014 8:12 AM, Loic Dachary wrote: >> Hi Veronica, >> >> I was really happy to hear you're going to work on the erasure code >> reliability model. Unless I'm mistaken your action plan was not published. >> Would you mind adding it to the wiki so I can comment on it ? Someone else >> might be interested to contribute too. I've had a discussion yesterday about >> durability models (internally) and it is not well understood. Your insight >> would be precious. >> >> Cheers >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions of KeyValueStore (leveldb) backend
Thanks! On May 26, 2014, at 12:55 PM, Haomai Wang wrote: > On Mon, May 26, 2014 at 9:46 AM, Guang Yang wrote: >> Hello Haomai, >> We are evaluating the key-value store backend which comes along with Firefly >> release (thanks for implementing it in Ceph), it is very promising for a >> couple of our use cases, after going through the related code change, I have >> a couple of questions which needs your help: >> 1. One observation is that, for object larger than 1KB, it will be striped >> to multiple chunks (k-v in the leveldb table), with one strip as 1KB size. >> Is there any particular reason we choose 1KB as the strip size (and I didn’t >> find a configuration to tune this value)? > > 1KB is not a serious value, this value can be configured in the near future. > >> >> 2. This is properly a leveldb question, do we expect performance >> degradation as the leveldb instance keeps increasing (e.g. several TB)? > > Ceph OSD is expected to own a physical disk, normally is several > TB(1-4TB). LevelDB can take it easy. Especially we use it to store > large value(compared to normal application usage). > >> >> Thanks, >> Guang > > > > -- > Best Regards, > > Wheat > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Questions of KeyValueStore (leveldb) backend
Hello Haomai, We are evaluating the key-value store backend which comes along with Firefly release (thanks for implementing it in Ceph), it is very promising for a couple of our use cases, after going through the related code change, I have a couple of questions which needs your help: 1. One observation is that, for object larger than 1KB, it will be striped to multiple chunks (k-v in the leveldb table), with one strip as 1KB size. Is there any particular reason we choose 1KB as the strip size (and I didn’t find a configuration to tune this value)? 2. This is properly a leveldb question, do we expect performance degradation as the leveldb instance keeps increasing (e.g. several TB)? Thanks, Guang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Changes of scrubbing?
Hi ceph-devel, Like some users of Ceph, we are using Ceph for a latency sensitive project, and scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial way, as commodity hardware could fail in one way or the other, I think it is essential to have scrubbing enabled to preserve data durability. Inspired by how erasure coding backend implement scrubbing[1], I am wondering if the following changes is valid to somehow reduce the performance impact from scrubbing: 1. Store the CRC checksum along with each physical copy of the object on filesystem (via xattr or omap?) 2. For read request, it checks the CRC locally and if it mismatch, redirect the request to a replica and mark the PG as inconsistent. This is just a general idea and details (like append) will need to further discussed. By having such, we can schedule the scrubbing less aggresively but still preserve the durability for read. Does this make some sense? [1] http://ceph.com/docs/master/dev/osd_internals/erasure_coding/pgbackend/ Thanks, Guang Yang-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw - bucket index
On May 19, 2014, at 7:05 AM, Sage Weil wrote: > On Sun, 18 May 2014, Guang wrote: radosgw is using the omap key/value API for objects, which is more or less equivalent to what swift is doing with sqlite. This data passes straight into leveldb on the backend (or whatever other backend you are using). Using something like rocksdb in its place is pretty simple and ther are unmerged patches to do that; the user would just need to adjust their crush map so that the rgw index pool is mapped to a different set of OSDs with the better k/v backend. >> Not sure if I miss anything, but the key difference with SWIFT?s >> implementation is that they are using a table for bucket index and it >> actually can be updated in parallel which makes more scalable for write, >> though at certain point the sql table would result in performance >> degradation as well. > > As I understand it the same limitation is present there too: the index is > in a single sqlite table. > >>> My more well-formed opinion is that we need to come up with a good >>> design. It needs to be flexible enough to be able to grow (and maybe >>> shrink), and I assume there would be some kind of background operation >>> that will enable that. I also believe that making it hash based is the >>> way to go. It looks like that the more complicated issue is here is >>> how to handle the transition in which we shard buckets. >> Yeah I agree. I think the conflicting goals here are, we want a sorted >> list (so that it enable prefix scan for listing purpose) and we want to >> shard at the very beginning (the problem we are facing is parallel >> writes updating the same bucket index object will need to be >> serialized). > > Given how infrequent container listings are, pre-sharding containers > across several objects makes some sense. Paying the cost of doing > listings in parallel across N (where N is not too big) is not a big price > to pay. However, there will always need to be a way to re-shard further > when containers/buckets get extremely big. Perhaps a starting point would > be support for static sharding where the number of shards is specified at > container/bucket creation time… Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable. Yehuda, How do you think? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html