Re: Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Thanks Sam for the confirmation.

Thanks,
Guang

On Mon, Jan 4, 2016 at 3:59 PM, Samuel Just  wrote:
> IIRC, you are running giant.  I think that's the log rotate dangling
> fd bug (not fixed in giant since giant is eol).  Fixed upstream
> 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
> b8e3f6e190809febf80af66415862e7c7e415214.
> -Sam
>
> On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang  wrote:
>> Hi Cephers,
>> Before I open a tracker, I would like check if it is a known issue or not..
>>
>> One one of our clusters, there was OSD crash during repairing,  the
>> crash happened after we issued a PG repair for inconsistent PGs, which
>> failed because the recorded file size (within xattr) mismatched with
>> the actual file size.
>>
>> The mismatch was caused by the fact that the content of the data file
>> are OSD logs, following is from osd.354 on c003:
>>
>> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
>> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
>> -bash-4.1$ head
>> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
>> 2016-01-03 07:30:01.600119 7f7fe2096700 15
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
>> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
>> 2016-01-03 07:30:01.604967 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
>> 2016-01-03 07:30:01.604984 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
>> 2016-01-03 07:30:01.604986 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.idtag'
>> 2016-01-03 07:30:01.604996 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
>> 2016-01-03 07:30:01.605007 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'snapset'
>> 2016-01-03 07:30:01.605013 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.manifest'
>> 2016-01-03 07:30:01.605026 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'hinfo_key'
>> 2016-01-03 07:30:01.605042 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.x-amz-meta-origin'
>> 2016-01-03 07:30:01.605049 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.acl'
>>
>>
>> This only happens on the clusters we turned on the verbose log
>> (debug_osd/filestore=20). And we are running ceph v0.87.
>>
>> Thanks,
>> Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Hi Cephers,
Before I open a tracker, I would like check if it is a known issue or not..

One one of our clusters, there was OSD crash during repairing,  the
crash happened after we issued a PG repair for inconsistent PGs, which
failed because the recorded file size (within xattr) mismatched with
the actual file size.

The mismatch was caused by the fact that the content of the data file
are OSD logs, following is from osd.354 on c003:

-rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
-bash-4.1$ head
"default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
2016-01-03 07:30:01.600119 7f7fe2096700 15
filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
2016-01-03 07:30:01.604967 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
2016-01-03 07:30:01.604984 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
2016-01-03 07:30:01.604986 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.idtag'
2016-01-03 07:30:01.604996 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
2016-01-03 07:30:01.605007 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'snapset'
2016-01-03 07:30:01.605013 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.manifest'
2016-01-03 07:30:01.605026 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'hinfo_key'
2016-01-03 07:30:01.605042 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.x-amz-meta-origin'
2016-01-03 07:30:01.605049 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.acl'


This only happens on the clusters we turned on the verbose log
(debug_osd/filestore=20). And we are running ceph v0.87.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Guang Yang
Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and 
> persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are 
> peered?

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-16 Thread Guang Yang
On Mon, Nov 16, 2015 at 5:42 PM, Sage Weil  wrote:
> On Mon, 16 Nov 2015, Guang Yang wrote:
>> I spoke to a leveldb expert, it looks like this is a known pattern on
>> LSM tree data structure - the tail latency for range scan could be far
>> longer than avg/median since it might need to mmap several sst files
>> to get the record.
>>
>> Hi Sage,
>> Do you see any harm to increase the default value for this setting
>> (e.g. 20 minutes)? Or should I add the advise for monitor
>> trouble-shooting?
>
> The timeout is just for a round trip for the sync process, right?  I think
> increasing it a bit (2x or 3x?) is okay, but 20 minutes to do a single
> chunk is a lot.
Yeah the timeout is for a single round trip (there is timeout reset
mechanism at both sides).
>
> The underlying problem in your cases is that your store is huge (by ~2
> orders of magnitude), so I'm not sure we should tune against that :)
Ok, let me apply the patches and monitor the db growth.
>
> sage
>
>
>  >
>> Thanks,
>> Guang
>>
>> On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang  wrote:
>> > Thanks Sage! I will definitely try those patches.
>> >
>> > For this one, I finally managed to bring the new monitor in by
>> > increasing the mon_sync_timeout from its default 60 to 6 to make
>> > sure the syncing does not restart and result in an infinite loop..
>> >
>> > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil  wrote:
>> >> On Fri, 13 Nov 2015, Guang Yang wrote:
>> >>> Thanks Sage!
>> >>>
>> >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
>> >>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>> >>> >> I was wrong the previous analysis, it was not the iterator got reset,
>> >>> >> the problem I can see now, is that during the syncing, a new round of
>> >>> >> election kicked off and thus it needs to probe the newly added
>> >>> >> monitor, however, since it hasn't been synced yet, it will restart the
>> >>> >> syncing from there.
>> >>> >
>> >>> > What version of this?  I think this is something we fixed a while back?
>> >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>> >>> a commit I can take a look?
>> >>
>> >> Hrm, I guess it was way befoer that.. I'm thinking of
>> >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
>> >> not sure exactly why an election would be restarting the sync in your
>> >> case..
>> >>
>> >> You mentioned elsewhere that your mon store was very large, though (more
>> >> than 10's of GB), which suggests you might be hitting the
>> >> min_last_epoch_clean problem (which prevents osdmap trimming).. see
>> >> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
>> >> and firefly but not giant.
>> >>
>> >> sage
>> >>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-16 Thread Guang Yang
I spoke to a leveldb expert, it looks like this is a known pattern on
LSM tree data structure - the tail latency for range scan could be far
longer than avg/median since it might need to mmap several sst files
to get the record.

Hi Sage,
Do you see any harm to increase the default value for this setting
(e.g. 20 minutes)? Or should I add the advise for monitor
trouble-shooting?

Thanks,
Guang

On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang  wrote:
> Thanks Sage! I will definitely try those patches.
>
> For this one, I finally managed to bring the new monitor in by
> increasing the mon_sync_timeout from its default 60 to 6 to make
> sure the syncing does not restart and result in an infinite loop..
>
> On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil  wrote:
>> On Fri, 13 Nov 2015, Guang Yang wrote:
>>> Thanks Sage!
>>>
>>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
>>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>>> >> I was wrong the previous analysis, it was not the iterator got reset,
>>> >> the problem I can see now, is that during the syncing, a new round of
>>> >> election kicked off and thus it needs to probe the newly added
>>> >> monitor, however, since it hasn't been synced yet, it will restart the
>>> >> syncing from there.
>>> >
>>> > What version of this?  I think this is something we fixed a while back?
>>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>>> a commit I can take a look?
>>
>> Hrm, I guess it was way befoer that.. I'm thinking of
>> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
>> not sure exactly why an election would be restarting the sync in your
>> case..
>>
>> You mentioned elsewhere that your mon store was very large, though (more
>> than 10's of GB), which suggests you might be hitting the
>> min_last_epoch_clean problem (which prevents osdmap trimming).. see
>> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
>> and firefly but not giant.
>>
>> sage
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
Thanks Sage! I will definitely try those patches.

For this one, I finally managed to bring the new monitor in by
increasing the mon_sync_timeout from its default 60 to 6 to make
sure the syncing does not restart and result in an infinite loop..

On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil  wrote:
> On Fri, 13 Nov 2015, Guang Yang wrote:
>> Thanks Sage!
>>
>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>> >> I was wrong the previous analysis, it was not the iterator got reset,
>> >> the problem I can see now, is that during the syncing, a new round of
>> >> election kicked off and thus it needs to probe the newly added
>> >> monitor, however, since it hasn't been synced yet, it will restart the
>> >> syncing from there.
>> >
>> > What version of this?  I think this is something we fixed a while back?
>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>> a commit I can take a look?
>
> Hrm, I guess it was way befoer that.. I'm thinking of
> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
> not sure exactly why an election would be restarting the sync in your
> case..
>
> You mentioned elsewhere that your mon store was very large, though (more
> than 10's of GB), which suggests you might be hitting the
> min_last_epoch_clean problem (which prevents osdmap trimming).. see
> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
> and firefly but not giant.
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
Thanks Sage!

On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
> On Fri, 13 Nov 2015, Guang Yang wrote:
>> I was wrong the previous analysis, it was not the iterator got reset,
>> the problem I can see now, is that during the syncing, a new round of
>> election kicked off and thus it needs to probe the newly added
>> monitor, however, since it hasn't been synced yet, it will restart the
>> syncing from there.
>
> What version of this?  I think this is something we fixed a while back?
This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
a commit I can take a look?

>
>> Hi Sage and Joao,
>> Is there a way to freeze the election by some tunable to let the sync finish?
>
> We can't not do elections when something is asking for one (e.g., mon
> is down).
I see. Is there an operational workaround we could try? From within
the log, I found the election was triggered by accepted timeout, thus
I increased the timeout value to hopefully squeeze election during
syncing, does that sounds a workaround?
>
> sage
>
>
>
>>
>> Thanks,
>> Guang
>>
>> On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang  wrote:
>> > Hi Joao,
>> > We have a problem when trying to add new monitors to the cluster on an
>> > unhealthy cluster, which I would like ask for your suggestion.
>> >
>> > After adding the new monitor, it  started syncing the store and went
>> > into an infinite loop:
>> >
>> > 2015-11-12 21:02:23.499510 7f1e8030e700 10
>> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
>> > cookie 4513071120 lc 14697737 bl 929616 bytes last_key
>> > osdmap,full_22530) v2
>> > 2015-11-12 21:02:23.712944 7f1e8030e700 10
>> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
>> > cookie 4513071120 lc 14697737 bl 799897 bytes last_key
>> > osdmap,full_3259) v2
>> >
>> >
>> > We talked early in the morning on IRC, and at the time I thought it
>> > was because the osdmap epoch was increasing, which lead to this
>> > infinite loop.
>> >
>> > I then set those nobackfill/norecovery flags and the osdmap epoch
>> > freezed, however, the problem is still there.
>> >
>> > While the osdmap epoch is 22531, the switch always happened at
>> > osdmap.full_22530 (as showed by the above log).
>> >
>> > Looking at the code at both sides, it looks this check
>> > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
>> > always true, and I can confirm from the log that (sp.last_commited <
>> > paxos->get_version()) was false, so the chance is that the
>> > sp.synchronizer always has next chunk?
>> >
>> > Does this look familiar to you? Or any other trouble shoot I can try?
>> > Thanks very much.
>> >
>> > Thanks,
>> > Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
I was wrong the previous analysis, it was not the iterator got reset,
the problem I can see now, is that during the syncing, a new round of
election kicked off and thus it needs to probe the newly added
monitor, however, since it hasn't been synced yet, it will restart the
syncing from there.

Hi Sage and Joao,
Is there a way to freeze the election by some tunable to let the sync finish?

Thanks,
Guang

On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang  wrote:
> Hi Joao,
> We have a problem when trying to add new monitors to the cluster on an
> unhealthy cluster, which I would like ask for your suggestion.
>
> After adding the new monitor, it  started syncing the store and went
> into an infinite loop:
>
> 2015-11-12 21:02:23.499510 7f1e8030e700 10
> mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> cookie 4513071120 lc 14697737 bl 929616 bytes last_key
> osdmap,full_22530) v2
> 2015-11-12 21:02:23.712944 7f1e8030e700 10
> mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> cookie 4513071120 lc 14697737 bl 799897 bytes last_key
> osdmap,full_3259) v2
>
>
> We talked early in the morning on IRC, and at the time I thought it
> was because the osdmap epoch was increasing, which lead to this
> infinite loop.
>
> I then set those nobackfill/norecovery flags and the osdmap epoch
> freezed, however, the problem is still there.
>
> While the osdmap epoch is 22531, the switch always happened at
> osdmap.full_22530 (as showed by the above log).
>
> Looking at the code at both sides, it looks this check
> (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
> always true, and I can confirm from the log that (sp.last_commited <
> paxos->get_version()) was false, so the chance is that the
> sp.synchronizer always has next chunk?
>
> Does this look familiar to you? Or any other trouble shoot I can try?
> Thanks very much.
>
> Thanks,
> Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2015-11-13 Thread Guang Yang
Hi Joao,
We have a problem when trying to add new monitors to the cluster on an
unhealthy cluster, which I would like ask for your suggestion.

After adding the new monitor, it  started syncing the store and went
into an infinite loop:

2015-11-12 21:02:23.499510 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 929616 bytes last_key
osdmap,full_22530) v2
2015-11-12 21:02:23.712944 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 799897 bytes last_key
osdmap,full_3259) v2


We talked early in the morning on IRC, and at the time I thought it
was because the osdmap epoch was increasing, which lead to this
infinite loop.

I then set those nobackfill/norecovery flags and the osdmap epoch
freezed, however, the problem is still there.

While the osdmap epoch is 22531, the switch always happened at
osdmap.full_22530 (as showed by the above log).

Looking at the code at both sides, it looks this check
(https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
always true, and I can confirm from the log that (sp.last_commited <
paxos->get_version()) was false, so the chance is that the
sp.synchronizer always has next chunk?

Does this look familiar to you? Or any other trouble shoot I can try?
Thanks very much.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbolic links like feature on radosgw

2015-11-02 Thread Guang Yang
Hi Yehuda,
We have a user requirement that needs symbolic link like feature on
radosgw - two object ids pointing to the same object (ideally it could
cross bucket, but same bucket is fine).

The closest feature on Amazon S3 I could find is [1], but not exact
the same, the one from Amazon S3 API was designed for static web site
hosting.

Is this a valid feature request we can put into radosgw? The way I am
thinking to implement is like symbolic link, the link object just
contains a pointer to the original object.

 [1] http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html

 --
 Regards,
 Guang




-- 
--
Regards,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weekly performance meeting

2014-09-26 Thread Guang Yang

On Sep 26, 2014, at 9:12 PM, Mark Nelson  wrote:

> On 09/25/2014 09:47 PM, Guang Yang wrote:
>> Hi Sage,
>> We are very interested to join (and contribute effort) as well. Following 
>> are a list of issues we have particular interests:
>>  1> Large number of small files bring performance degradation most due to 
>> file system lookup (even worst with EC).
> 
> Have you tried decreasing vfs_cache_pressure to retain dentries and inodes in 
> cache?  I've had good luck improve performance for medium sized IO workloads 
> doing this.
Yeah we changed the setting from its default value 100 to 20 and it turned out 
improvement for dentry/inode cache (we also tried setting it to 1 but got OOM 
in some traffic pattern). Even with the setting change, given the object size 
is several hundred KB, we still observed lookup miss which increase latency, 
this became worst when we turned to EC as: 1) More files on each system. 2) The 
long tail determine the latency.
> 
>>  2> Messenger uses too many threads which bring burden for high density 
>> hardware (which I believe Haomai already has great progress).
> 
> Yes, The biggest thing on my personal wish list has been to move to a hybrid 
> threading/event processing model.
> 
>> 
>> Thanks,
>> Guang
>> 
>> On Sep 26, 2014, at 2:27 AM, Sage Weil  wrote:
>> 
>>> Hi everyone,
>>> 
>>> A number of people have approached me about how to get more involved with
>>> the current work on improving performance and how to better coordinate
>>> with other interested parties.  A few meetings have taken place offline
>>> with good results but only a few interested parties were involved.
>>> 
>>> Ideally, we'd like to move as much of this dicussion into the public
>>> forums: ceph-devel@vger.kernel.org and #ceph-devel.  That isn't always
>>> sufficient, however.  I'd like to also set up a regular weekly meeting
>>> using google hangouts or bluejeans so that all interested parties can
>>> share progress.  There are a lot of things we can do during the Hammer
>>> cycle to improve things but it will require some coordination of effort.
>>> 
>>> Among other things, we can discuss:
>>> 
>>> - observed performance limitations
>>> - high level strategies for addressing them
>>> - proposed patch sets and their performance impact
>>> - anything else that will move us forward
>>> 
>>> One challenge is timezones: there are developers in the US, China, Europe,
>>> and Israel who may want to join.  As a starting point, how about next
>>> Wednesday, 15:00 UTC?  If I didn't do my tz math wrong, that's
>>> 
>>>  8:00 (PDT, California)
>>> 15:00 (UTC)
>>> 18:00 (IDT, Israel)
>>> 23:00 (CST, China)
>>> 
>>> That is surely not the ideal time for everyone but it can hopefully be a
>>> starting point.
>>> 
>>> I've also created an etherpad for collecting discussion/agenda items at
>>> 
>>> http://pad.ceph.com/p/performance_weekly
>>> 
>>> Is there interest here?  Please let everyone know if you are actively
>>> working in this area and/or would like to join, and update the pad above
>>> with the topics you would like to discuss.
>>> 
>>> Thanks!
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weekly performance meeting

2014-09-25 Thread Guang Yang
Hi Sage,
We are very interested to join (and contribute effort) as well. Following are a 
list of issues we have particular interests:
 1> Large number of small files bring performance degradation most due to file 
system lookup (even worst with EC).
 2> Messenger uses too many threads which bring burden for high density 
hardware (which I believe Haomai already has great progress).

Thanks,
Guang

On Sep 26, 2014, at 2:27 AM, Sage Weil  wrote:

> Hi everyone,
> 
> A number of people have approached me about how to get more involved with 
> the current work on improving performance and how to better coordinate 
> with other interested parties.  A few meetings have taken place offline 
> with good results but only a few interested parties were involved.
> 
> Ideally, we'd like to move as much of this dicussion into the public 
> forums: ceph-devel@vger.kernel.org and #ceph-devel.  That isn't always 
> sufficient, however.  I'd like to also set up a regular weekly meeting 
> using google hangouts or bluejeans so that all interested parties can 
> share progress.  There are a lot of things we can do during the Hammer 
> cycle to improve things but it will require some coordination of effort.
> 
> Among other things, we can discuss:
> 
> - observed performance limitations
> - high level strategies for addressing them
> - proposed patch sets and their performance impact
> - anything else that will move us forward
> 
> One challenge is timezones: there are developers in the US, China, Europe, 
> and Israel who may want to join.  As a starting point, how about next 
> Wednesday, 15:00 UTC?  If I didn't do my tz math wrong, that's
> 
>  8:00 (PDT, California)
> 15:00 (UTC)
> 18:00 (IDT, Israel)
> 23:00 (CST, China)
> 
> That is surely not the ideal time for everyone but it can hopefully be a 
> starting point.
> 
> I've also created an etherpad for collecting discussion/agenda items at
> 
>   http://pad.ceph.com/p/performance_weekly
> 
> Is there interest here?  Please let everyone know if you are actively 
> working in this area and/or would like to join, and update the pad above 
> with the topics you would like to discuss.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RGW threads hung - more logs

2014-09-11 Thread Guang Yang
Hi Sage, Sam and Greg,
With the radosgw hung issue we discussed this today, I finally got some more 
logs showing that the reply message has been received by ragosgw, but failed to 
be dispatched as dispatcher thread was hung. I put all the logs into the 
tracker - http://tracker.ceph.com/issues/9008

While the logs explain what we observed, I failed to find any clue that why the 
dispatcher would need to wait for objecter_bytes throttler budget, did I miss 
anything obvious here?

Tracker link - http://tracker.ceph.com/issues/9008

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Issue - 8907

2014-08-28 Thread Guang Yang
Hi Loic,
Can you help to take a quick look over this issue - 
http://tracker.ceph.com/issues/8907, was it a design choice due to consistency 
concern?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-20 Thread Guang Yang
Thanks Sage. We will provide a patch based on this.

Thanks,
Guang

On Aug 20, 2014, at 11:19 PM, Sage Weil  wrote:

> On Wed, 20 Aug 2014, Guang Yang wrote:
>> Thanks Greg.
>> On Aug 20, 2014, at 6:09 AM, Gregory Farnum  wrote:
>> 
>>> On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang  wrote:
>>>> Hi ceph-devel,
>>>> David (cc?ed) reported a bug (http://tracker.ceph.com/issues/9128) which 
>>>> we came across in our test cluster during our failure testing, basically 
>>>> the way to reproduce it was to leave one OSD daemon down and in for a day, 
>>>> at the same time, keep giving write traffic. When the OSD daemon was 
>>>> started again, it hit suicide timeout and kill itself.
>>>> 
>>>> After some analysis (details in the bug), David found that the op thread 
>>>> was busy searching for missing objects and once the volume to search 
>>>> increase, the thread is expected to work that long time, please refer to 
>>>> the bug for detailed logs.
>>> 
>>> Can you talk a little more about what's going on here? At a quick
>>> naive glance, I'm not seeing why leaving an OSD down and in should
>>> require work based on the amount of write traffic. Perhaps if the rest
>>> of the cluster was changing mappings??
>> We increased the down to out time interval from 5 minutes to 2 days to 
>> avoid migrating data back and forth which could increase latency, so 
>> that we target to mark OSD out manually. To achieve such, we are testing 
>> against some boundary cases to let the OSD down and in for like 1 day, 
>> however, when we try to bring it up again, it always failed due to hit 
>> the suicide timeout.
> 
> Looking at the log snippet I see the PG had log range
> 
>   5481'28667,5646'34066
> 
> Which is ~5500 log events.  The default max is 10k.  search_for_missing is 
> basically going to iterate over this list and check if the object is 
> present locally.
> 
> If that's slow enough to trigger a suicide (which it seems to be), teh 
> fix is simple: as Greg says we just need to make it probe the internel 
> heartbeat code to indicate progress.  In most contexts this is done by 
> passing a ThreadPool::TPHandle &handle into each method and then 
> calling handle.reset_tp_timeout() on each iteration.  The same needs to be 
> done for search_for_missing...
> 
> sage
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-20 Thread Guang Yang
Thanks Greg.
On Aug 20, 2014, at 6:09 AM, Gregory Farnum  wrote:

> On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang  wrote:
>> Hi ceph-devel,
>> David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we 
>> came across in our test cluster during our failure testing, basically the 
>> way to reproduce it was to leave one OSD daemon down and in for a day, at 
>> the same time, keep giving write traffic. When the OSD daemon was started 
>> again, it hit suicide timeout and kill itself.
>> 
>> After some analysis (details in the bug), David found that the op thread was 
>> busy searching for missing objects and once the volume to search increase, 
>> the thread is expected to work that long time, please refer to the bug for 
>> detailed logs.
> 
> Can you talk a little more about what's going on here? At a quick
> naive glance, I'm not seeing why leaving an OSD down and in should
> require work based on the amount of write traffic. Perhaps if the rest
> of the cluster was changing mappings…?
We increased the down to out time interval from 5 minutes to 2 days to avoid 
migrating data back and forth which could increase latency, so that we target 
to mark OSD out manually. To achieve such, we are testing against some boundary 
cases to let the OSD down and in for like 1 day, however, when we try to bring 
it up again, it always failed due to hit the suicide timeout.
> 
>> 
>> One simple fix is to let the op thread reset the suicide timeout 
>> periodically when it is doing long-time work, other fix might be to cut the 
>> work into smaller pieces?
> 
> We do both of those things throughout the OSD (although I think the
> first is simpler and more common); search for the accesses to
> cct->get_heartbeat_map()->reset_timeout.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-18 Thread Guang Yang
Hi ceph-devel,
David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we 
came across in our test cluster during our failure testing, basically the way 
to reproduce it was to leave one OSD daemon down and in for a day, at the same 
time, keep giving write traffic. When the OSD daemon was started again, it hit 
suicide timeout and kill itself.

After some analysis (details in the bug), David found that the op thread was 
busy searching for missing objects and once the volume to search increase, the 
thread is expected to work that long time, please refer to the bug for detailed 
logs.

One simple fix is to let the op thread reset the suicide timeout periodically 
when it is doing long-time work, other fix might be to cut the work into 
smaller pieces?

Any suggestion is welcome.

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-17 Thread Guang Yang
Hi Huamin,
Then it might be a totally different issue than the one I mentioned below, 
please file a bug to http://tracker.ceph.com/ with more details (the log before 
the daemon crashed).

Thanks,
Guang

On Aug 16, 2014, at 5:36 AM, Huamin Chen  wrote:

> Thanks. I was running a single node ceph fs cluster on a VM. Each time the VM 
> is created, it downloads the latest bits and runs unit tests. There are many 
> mount and unmount during the tests.
> This issue can be reliably reproduced in one of these tests.
> 
> The test info can be found 
> 
> 
> - Original Message -
> From: "Guang Yang" 
> To: "Huamin Chen" 
> Cc: "Ceph-devel" 
> Sent: Friday, August 15, 2014 2:23:12 PM
> Subject: Re: assert failure
> 
> + ceph-devel.
> 
> Hi Huamin,
> Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if 
> its peer has the old version, it could crash the new version as well.
> 
> Thanks,
> Guang
> 
> On Aug 14, 2014, at 11:21 PM, Huamin Chen  wrote:
> 
>> Bad news, still there ...
>> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time 
>> 2014-08-14 15:16:44.659312
>> msg/Pipe.cc: 1080: FAILED assert(m)
>> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
>> 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac]
>> 2: (Pipe::writer()+0x9f3) [0x7f327552aff3]
>> 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d]
>> 4: (()+0x79d1) [0x7f32953449d1]
>> 5: (clone()+0x6d) [0x7f3294c89b5d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>> 
>> Attached please find all related logs
>> 
>> - Original Message -
>> From: "Guang Yang" 
>> To: "Huamin Chen" 
>> Cc: ceph-devel@vger.kernel.org
>> Sent: Wednesday, August 13, 2014 10:39:10 PM
>> Subject: Re: assert failure
>> 
>> Hi Huamin,
>> At least one known issue in 0.80.1 with the same failing pattern has been 
>> fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
>> http://tracker.ceph.com/issues/8232.
>> 
>> Can you compare the log snippets from within the bug and see if they are the 
>> same issue?
>> 
>> Thanks,
>> Guang
>> 
>> On Aug 14, 2014, at 4:29 AM, Huamin Chen  wrote:
>> 
>>> Is the following assert failure an known issue?
>>> 
>>> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
>>> 2014-08-13 16:26:06.039799
>>> msg/Pipe.cc: 1070: FAILED assert(m)
>>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>> 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
>>> 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
>>> 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
>>> 4: (()+0x7df3) [0x7fef336cadf3]
>>> 5: (clone()+0x6d) [0x7fef32fe63dd]
>>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>>> interpret this.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-15 Thread Guang Yang
+ ceph-devel.

Hi Huamin,
Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if its 
peer has the old version, it could crash the new version as well.

Thanks,
Guang

On Aug 14, 2014, at 11:21 PM, Huamin Chen  wrote:

> Bad news, still there ...
> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time 
> 2014-08-14 15:16:44.659312
> msg/Pipe.cc: 1080: FAILED assert(m)
> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
> 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac]
> 2: (Pipe::writer()+0x9f3) [0x7f327552aff3]
> 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d]
> 4: (()+0x79d1) [0x7f32953449d1]
> 5: (clone()+0x6d) [0x7f3294c89b5d]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> 
> Attached please find all related logs
> 
> - Original Message -
> From: "Guang Yang" 
> To: "Huamin Chen" 
> Cc: ceph-devel@vger.kernel.org
> Sent: Wednesday, August 13, 2014 10:39:10 PM
> Subject: Re: assert failure
> 
> Hi Huamin,
> At least one known issue in 0.80.1 with the same failing pattern has been 
> fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
> http://tracker.ceph.com/issues/8232.
> 
> Can you compare the log snippets from within the bug and see if they are the 
> same issue?
> 
> Thanks,
> Guang
> 
> On Aug 14, 2014, at 4:29 AM, Huamin Chen  wrote:
> 
>> Is the following assert failure an known issue?
>> 
>> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
>> 2014-08-13 16:26:06.039799
>> msg/Pipe.cc: 1070: FAILED assert(m)
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>> 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
>> 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
>> 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
>> 4: (()+0x7df3) [0x7fef336cadf3]
>> 5: (clone()+0x6d) [0x7fef32fe63dd]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD disk replacement best practise

2014-08-14 Thread Guang Yang
Hi cephers,
Most recently I am drafting the run books for OSD disk replacement, I think the 
rule of thumb is to reduce data migration (recover/backfill), and I thought the 
following procedure should achieve the purpose:
  1. ceph osd out osd.XXX (mark it out to trigger data migration)
  2. ceph osd rm osd.XXX
  3. ceph auth rm osd.XXX
  4. provision a new OSD which will take XXX as the OSD id and migrate data 
back.

With the above procedure, the crush weight of the host never changed so that we 
can limit the data migration only for those which are neccesary.

Does it make sense?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-13 Thread Guang Yang
Hi Huamin,
At least one known issue in 0.80.1 with the same failing pattern has been fixed 
in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
http://tracker.ceph.com/issues/8232.

Can you compare the log snippets from within the bug and see if they are the 
same issue?

Thanks,
Guang

On Aug 14, 2014, at 4:29 AM, Huamin Chen  wrote:

> Is the following assert failure an known issue?
> 
> msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
> 2014-08-13 16:26:06.039799
> msg/Pipe.cc: 1070: FAILED assert(m)
> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
> 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
> 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
> 4: (()+0x7df3) [0x7fef336cadf3]
> 5: (clone()+0x6d) [0x7fef32fe63dd]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bucket index sharding - IO throttle

2014-08-12 Thread Guang Yang
Hi Yehuda,
Can you help to review the latest patch with throttle mechanism you suggested. 
Thanks!

Thanks,
Guang
On Aug 4, 2014, at 3:20 PM, Guang Yang  wrote:

> Hi Yehuda,
> Here is the new pull request - https://github.com/ceph/ceph/pull/2187
> 
> Thanks,
> Guang
> On Jul 31, 2014, at 10:40 PM, Guang Yang  wrote:
> 
>> Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
>> recently but I will try my best to provide a patch as soon as possible).
>> 
>> Thanks,
>> Guang
>> 
>> 在 2014年7月31日,上午1:00,Yehuda Sadeh  写道:
>> 
>>> Can you send this code through a github pull request (or at least as a
>>> patch)? It'lll be easier to review and comment.
>>> 
>>> Thanks,
>>> Yehuda
>>> 
>>> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang  wrote:
>>>> +ceph-devel.
>>>> 
>>>> Thanks,
>>>> Guang
>>>> 
>>>> On Jul 29, 2014, at 10:20 PM, Guang Yang  wrote:
>>>> 
>>>>> Hi Yehuda,
>>>>> Per you review comment in terms of IO throttling for bucket index 
>>>>> operation, I prototyped the below code (details still need to polish), 
>>>>> can you take a look if that is right way to go?
>>>>> 
>>>>> Another problem I came across is that 
>>>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket 
>>>>> index init op (below), is there anything I missed obviously here?
>>>>> 
>>>>> Thanks,
>>>>> Guang
>>>>> 
>>>>> 
>>>>> class ClsBucketIndexAioThrottler {
>>>>> protected:
>>>>> int completed;
>>>>> int ret_code;
>>>>> IoCtx& io_ctx;
>>>>> Mutex lock;
>>>>> struct LockCond {
>>>>> Mutex lock;
>>>>> Cond cond;
>>>>> LockCond() : lock("LockCond"), cond() {}
>>>>> } lock_cond;
>>>>> public:
>>>>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx)
>>>>> : completed(0), ret_code(0), io_ctx(_io_ctx),
>>>>> lock("ClsBucketIndexAioThrottler"), lock_cond() {}
>>>>> 
>>>>> virtual ~ClsBucketIndexAioThrottler() {}
>>>>> virtual void do_next() = 0;
>>>>> virtual bool is_completed () = 0;
>>>>> 
>>>>> void complete(int ret) {
>>>>> {
>>>>>   Mutex::Locker l(lock);
>>>>>   if (ret < 0)
>>>>> ret_code = ret;
>>>>>   ++completed;
>>>>> }
>>>>> 
>>>>> lock_cond.lock.Lock();
>>>>> lock_cond.cond.Signal();
>>>>> lock_cond.lock.Unlock();
>>>>> }
>>>>> 
>>>>> int get_ret_code () {
>>>>> Mutex::Locker l(lock);
>>>>> return ret_code;
>>>>> }
>>>>> 
>>>>> virtual int wait_completion() {
>>>>> lock_cond.lock.Lock();
>>>>> while (1) {
>>>>>   if (is_completed()) {
>>>>> lock_cond.lock.Unlock();
>>>>> return ret_code;
>>>>>   }
>>>>>   lock_cond.cond.Wait(lock_cond.lock);
>>>>>   lock_cond.lock.Lock();
>>>>> }
>>>>> }
>>>>> };
>>>>> 
>>>>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
>>>>> protected:
>>>>> vector bucket_objects;
>>>>> vector::iterator iter_pos;
>>>>> public:
>>>>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector 
>>>>> _bucket_objs)
>>>>> : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
>>>>> iter_pos(bucket_objects.begin()) {}
>>>>> 
>>>>> virtual bool is_completed() {
>>>>> Mutex::Locker l(lock);
>>>>> int sent = 0;
>>>>> vector::iterator iter = bucket_objects.begin();
>>>>> for (; iter != iter_pos; ++iter) ++sent;
>>>>> 
>>>>> return (sent == completed &&
>>>>> (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 
>>>>> /*Failure*/));
>>>>> }
>>>>> };
>>>>> 
>>>>> template
>>>>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
&g

Re: bucket index sharding - IO throttle

2014-08-06 Thread Guang Yang
Hi Osier,
I doubt the issue is related (the error message is connection failure), the 
below patch is pretty simple (and incomplete), what it does is to add a 
configuration to bucket meta info so that we can configure the number of shards 
on bucket basis (again, this is not included in the patch).

The patch should be completely backward compatible which means if you don’t 
change the number of shards configuration, nothing should be changed for bucket 
creation/listing.

My plan is to use this patch as a starting point to review, as the key building 
blocks are included in the patch and once it is passed review, I will create a 
bunch of following patches to implement the feature completely (which are 
mostly done in the previous big patch - https://github.com/ceph/ceph/pull/2013).

I tested the patch locally in my cluster and it looks good for bucket creation.

Thanks,
Guang

On Aug 6, 2014, at 12:38 PM, Osier Yang  wrote:

> 
> On 2014年08月04日 15:20, Guang Yang wrote:
>> Hi Yehuda,
>> Here is the new pull request - https://github.com/ceph/ceph/pull/2187
> 
> I simply applied the patch on git top, and the testing shows
> "rest-bench" is completely
> broken with the 2 patches:
> 
> 
> root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
> --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
> --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http
> --uri_style=path --bucket=cool0 --seconds=20 --concurrent-ios=50
> --block-size=204800 --show-time write
> host=testing-s3gw0
> 2014-08-06 12:28:56.500235 7f1336645780 -1 did not load config file,
> using default settings.
> ERROR: failed to create bucket: ConnectionFailed
> failed initializing benchmark
> 
> The related debug log entry:
> 
> 2014-08-06 12:29:48.137559 7fea62fcd700 20 state for
> obj=.rgw:.bucket.meta.rest-bench-bucket:default.9738.2 is not atomic,
> not appending atomic test
> 
> After a short time, all the memory was eaten up:
> 
> root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
> --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
> --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http
> --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800
> --show-time write
> -bash: fork: Cannot allocate memory
> root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
> --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
> --secret="lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1" --protocol=http
> --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800
> --show-time write
> -bash: fork: Cannot allocate memory
> root@testing-s3gw0:~/s3-tests# free
> -bash: fork: Cannot allocate memory
> 
> A few mins later, the VM is completely unresponsible. And I had to
> destroy it and restart again.
> 
> Guang, how was your testing when creating the patches?
> 
>> 
>> 
>> Thanks,
>> Guang
>> On Jul 31, 2014, at 10:40 PM, Guang Yang  wrote:
>> 
>>> Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
>>> recently but I will try my best to provide a patch as soon as possible).
>>> 
>>> Thanks,
>>> Guang
>>> 
>>> 在 2014年7月31日,上午1:00,Yehuda Sadeh  写道:
>>> 
>>>> Can you send this code through a github pull request (or at least as a
>>>> patch)? It'lll be easier to review and comment.
>>>> 
>>>> Thanks,
>>>> Yehuda
>>>> 
>>>> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang  wrote:
>>>>> +ceph-devel.
>>>>> 
>>>>> Thanks,
>>>>> Guang
>>>>> 
>>>>> On Jul 29, 2014, at 10:20 PM, Guang Yang  wrote:
>>>>> 
>>>>>> Hi Yehuda,
>>>>>> Per you review comment in terms of IO throttling for bucket index 
>>>>>> operation, I prototyped the below code (details still need to polish), 
>>>>>> can you take a look if that is right way to go?
>>>>>> 
>>>>>> Another problem I came across is that 
>>>>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket 
>>>>>> index init op (below), is there anything I missed obviously here?
>>>>>> 
>>>>>> Thanks,
>>>>>> Guang
>>>>>> 
>>>>>> 
>>>>>> class ClsBucketIndexAioThrottler {
>>>>>> protected:
>>>>>> int completed;
>>>>>> int ret_code;
>>>>>> IoCtx& io_ctx;
>>>>>> Mutex lock;
>>>>>> struct LockCond {

Re: KeyFileStore ?

2014-08-04 Thread Guang Yang
在 2014年8月2日,上午5:34,Samuel Just  写道:

> Sage's basic approach sounds about right to me.  I'm fairly skeptical
> about the benefits of packing small objects together within larger
> files, though.  It seems like for very small objects, we would be
> better off stashing the contents opportunistically within the onode.
I really like this idea, for radosgw + EC use case, there are lots of small 
physical files generated (multiple Kbs), and when the OSD disk is filled to a 
certain ratio, each read to one chunk could incur several disk I/Os (path 
lookup and data), and putting the data as part of onode could boost the read 
performance and as the same time, decrease the number of physical files.
> For somewhat larger objects, it seems like the complexity of
> maintaining information about the larger pack objects would be
> equivalent to the what the filesystem would do anyway.
> -Sam
> 
> On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang  wrote:
>> I really like the idea, one scenario keeps bothering us is that there are 
>> too many small files which make the file system indexing slow (so that a 
>> single read request could take more than 10 disk IOs for path lookup).
>> 
>> If we pursuit this proposal, is there a chance we can take one step further, 
>> that instead of storing one physical file for each object, we can allocate a 
>> big file (tens of GB) and each object only map to a chunk within that big 
>> file. So that all those big file’s description could be cached to avoid disk 
>> I/O to open the file. At least we keep it flexible that if someone would 
>> like to implement in such way, there is a chance to leverage the existing 
>> framework.
>> 
>> Thanks,
>> Guang
>> 
>> On Jul 31, 2014, at 1:25 PM, Sage Weil  wrote:
>> 
>>> After the latest set of bug fixes to the FileStore file naming code I am
>>> newly inspired to replace it with something less complex.  Right now I'm
>>> mostly thinking about HDDs, although some of this may map well onto hybrid
>>> SSD/HDD as well.  It may or may not make sense for pure flash.
>>> 
>>> Anyway, here are the main flaws with the overall approach that FileStore
>>> uses:
>>> 
>>> - It tries to maintain a direct mapping of object names to file names.
>>> This is problematic because of 255 character limits, rados namespaces, pg
>>> prefixes, and the pg directory hashing we do to allow efficient split, for
>>> starters.  It is also problematic because we often want to do things like
>>> rename but can't make it happen atomically in combination with the rest of
>>> our transaction.
>>> 
>>> - The PG directory hashing (that we do to allow efficient split) can have
>>> a big impact on performance, particularly when injesting lots of data.
>>> (And when benchmarking.)  It's also complex.
>>> 
>>> - We often overwrite or replace entire objects.  These are "easy"
>>> operations to do safely without doing complete data journaling, but the
>>> current design is not conducive to doing anything clever (and it's complex
>>> enough that I wouldn't want to add any cleverness on top).
>>> 
>>> - Objects may contain only key/value data, but we still have to create an
>>> inode for them and look that up first.  This only matters for some
>>> workloads (rgw indexes, cephfs directory objects).
>>> 
>>> Instead, I think we should try a hybrid approach that more heavily
>>> leverages a key/value db in combination with the file system.  The kv db
>>> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
>>> assume it provides transactional key/value storage and efficient range
>>> operations.  Here's the basic idea:
>>> 
>>> - The mapping from names to object lives in the kv db.  The object
>>> metadata is in a structure we can call an "onode" to avoid confusing it
>>> with the inodes in the backing file system.  The mapping is simple
>>> ghobject_t -> onode map; there is no PG collection.  The PG collection
>>> still exist but really only as ranges of those keys.  We will need to be
>>> slightly clever with the coll_t to distinguish between "bare" PGs (that
>>> live in this flat mapping) and the other collections (*_temp and
>>> metadata), but that should be easy.  This makes PG splitting "free" as far
>>> as the objects go.
>>> 
>>> - The onodes are relatively small.  They will contain the xattrs and
>>> basic metadata like object size.  They will also iden

Re: bucket index sharding - IO throttle

2014-08-04 Thread Guang Yang
Hi Yehuda,
Here is the new pull request - https://github.com/ceph/ceph/pull/2187

Thanks,
Guang
On Jul 31, 2014, at 10:40 PM, Guang Yang  wrote:

> Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
> recently but I will try my best to provide a patch as soon as possible).
> 
> Thanks,
> Guang
> 
> 在 2014年7月31日,上午1:00,Yehuda Sadeh  写道:
> 
>> Can you send this code through a github pull request (or at least as a
>> patch)? It'lll be easier to review and comment.
>> 
>> Thanks,
>> Yehuda
>> 
>> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang  wrote:
>>> +ceph-devel.
>>> 
>>> Thanks,
>>> Guang
>>> 
>>> On Jul 29, 2014, at 10:20 PM, Guang Yang  wrote:
>>> 
>>>> Hi Yehuda,
>>>> Per you review comment in terms of IO throttling for bucket index 
>>>> operation, I prototyped the below code (details still need to polish), can 
>>>> you take a look if that is right way to go?
>>>> 
>>>> Another problem I came across is that 
>>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index 
>>>> init op (below), is there anything I missed obviously here?
>>>> 
>>>> Thanks,
>>>> Guang
>>>> 
>>>> 
>>>> class ClsBucketIndexAioThrottler {
>>>> protected:
>>>> int completed;
>>>> int ret_code;
>>>> IoCtx& io_ctx;
>>>> Mutex lock;
>>>> struct LockCond {
>>>>  Mutex lock;
>>>>  Cond cond;
>>>>  LockCond() : lock("LockCond"), cond() {}
>>>> } lock_cond;
>>>> public:
>>>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx)
>>>>  : completed(0), ret_code(0), io_ctx(_io_ctx),
>>>>  lock("ClsBucketIndexAioThrottler"), lock_cond() {}
>>>> 
>>>> virtual ~ClsBucketIndexAioThrottler() {}
>>>> virtual void do_next() = 0;
>>>> virtual bool is_completed () = 0;
>>>> 
>>>> void complete(int ret) {
>>>>  {
>>>>Mutex::Locker l(lock);
>>>>if (ret < 0)
>>>>  ret_code = ret;
>>>>++completed;
>>>>  }
>>>> 
>>>>  lock_cond.lock.Lock();
>>>>  lock_cond.cond.Signal();
>>>>  lock_cond.lock.Unlock();
>>>> }
>>>> 
>>>> int get_ret_code () {
>>>>  Mutex::Locker l(lock);
>>>>  return ret_code;
>>>> }
>>>> 
>>>> virtual int wait_completion() {
>>>>  lock_cond.lock.Lock();
>>>>  while (1) {
>>>>if (is_completed()) {
>>>>  lock_cond.lock.Unlock();
>>>>  return ret_code;
>>>>}
>>>>lock_cond.cond.Wait(lock_cond.lock);
>>>>lock_cond.lock.Lock();
>>>>  }
>>>> }
>>>> };
>>>> 
>>>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
>>>> protected:
>>>> vector bucket_objects;
>>>> vector::iterator iter_pos;
>>>> public:
>>>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector 
>>>> _bucket_objs)
>>>>  : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
>>>>  iter_pos(bucket_objects.begin()) {}
>>>> 
>>>> virtual bool is_completed() {
>>>>  Mutex::Locker l(lock);
>>>>  int sent = 0;
>>>>  vector::iterator iter = bucket_objects.begin();
>>>>  for (; iter != iter_pos; ++iter) ++sent;
>>>> 
>>>>  return (sent == completed &&
>>>>  (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 
>>>> /*Failure*/));
>>>> }
>>>> };
>>>> 
>>>> template
>>>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
>>>> private:
>>>> T* data;
>>>> // Return code of the operation
>>>> int* ret_code;
>>>> 
>>>> // The Aio completion object associated with this Op, it should
>>>> // be release from within the completion handler
>>>> librados::AioCompletion* completion;
>>>> ClsBucketIndexAioThrottler* throttler;
>>>> public:
>>>> ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
>>>> _completion,
>>>>ClsBucketIndexAioThrottler* _throttler)
>&g

Re: KeyFileStore ?

2014-08-01 Thread Guang Yang
I really like the idea, one scenario keeps bothering us is that there are too 
many small files which make the file system indexing slow (so that a single 
read request could take more than 10 disk IOs for path lookup).

If we pursuit this proposal, is there a chance we can take one step further, 
that instead of storing one physical file for each object, we can allocate a 
big file (tens of GB) and each object only map to a chunk within that big file. 
So that all those big file’s description could be cached to avoid disk I/O to 
open the file. At least we keep it flexible that if someone would like to 
implement in such way, there is a chance to leverage the existing framework.

Thanks,
Guang

On Jul 31, 2014, at 1:25 PM, Sage Weil  wrote:

> After the latest set of bug fixes to the FileStore file naming code I am 
> newly inspired to replace it with something less complex.  Right now I'm 
> mostly thinking about HDDs, although some of this may map well onto hybrid 
> SSD/HDD as well.  It may or may not make sense for pure flash.
> 
> Anyway, here are the main flaws with the overall approach that FileStore 
> uses:
> 
> - It tries to maintain a direct mapping of object names to file names.  
> This is problematic because of 255 character limits, rados namespaces, pg 
> prefixes, and the pg directory hashing we do to allow efficient split, for 
> starters.  It is also problematic because we often want to do things like 
> rename but can't make it happen atomically in combination with the rest of 
> our transaction.
> 
> - The PG directory hashing (that we do to allow efficient split) can have 
> a big impact on performance, particularly when injesting lots of data.  
> (And when benchmarking.)  It's also complex.
> 
> - We often overwrite or replace entire objects.  These are "easy" 
> operations to do safely without doing complete data journaling, but the 
> current design is not conducive to doing anything clever (and it's complex 
> enough that I wouldn't want to add any cleverness on top).
> 
> - Objects may contain only key/value data, but we still have to create an 
> inode for them and look that up first.  This only matters for some 
> workloads (rgw indexes, cephfs directory objects).
> 
> Instead, I think we should try a hybrid approach that more heavily 
> leverages a key/value db in combination with the file system.  The kv db 
> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just 
> assume it provides transactional key/value storage and efficient range 
> operations.  Here's the basic idea:
> 
> - The mapping from names to object lives in the kv db.  The object 
> metadata is in a structure we can call an "onode" to avoid confusing it 
> with the inodes in the backing file system.  The mapping is simple 
> ghobject_t -> onode map; there is no PG collection.  The PG collection 
> still exist but really only as ranges of those keys.  We will need to be 
> slightly clever with the coll_t to distinguish between "bare" PGs (that 
> live in this flat mapping) and the other collections (*_temp and 
> metadata), but that should be easy.  This makes PG splitting "free" as far 
> as the objects go.
> 
> - The onodes are relatively small.  They will contain the xattrs and 
> basic metadata like object size.  They will also identify the file name of 
> the backing file in the file system (if size > 0).
> 
> - The backing file can be a random, short file name.  We can just make a 
> one or two level deep set of directories, and let the directories get 
> reasonably big... whatever we decide the backing fs can handle 
> efficiently.  We can also store a file handle in the onode and use the 
> open by handle API; this should let us go directly from onode (in our kv 
> db) to the on-disk inode without looking at the directory at all, and fall 
> back to using the actual file name only if that fails for some reason 
> (say, someone mucked around with the backing files).  The backing file 
> need not have any xattrs on it at all (except perhaps some simple id to 
> verify it does it fact belong to the referring onode, just as a sanity 
> check).
> 
> - The name -> onode mapping can live in a disjunct part of the kv 
> namespace so that the other kv stuff associated with the file (like omap 
> pairs or big xattrs or whatever) don't blow up those parts of the 
> db and slow down lookup.
> 
> - We can keep a simple LRU of recent onodes in memory and avoid the kv 
> lookup for hot objects.
> 
> - Previously complicated operations like rename are now trivial: we just 
> update the kv db with a transaction.  The backing file never gets renamed, 
> ever, and the other object omap data is keyed by a unique (onode) id, not 
> the name.
> 
> Initially, for simplicity, we can start with the existing data journaling 
> behavior.  However, I think there are opportunities to improve the 
> situation there.  There is a pending wip-transactions branch in which I 
> started to rejigger the ObjectStore::Trans

Re: bucket index sharding - IO throttle

2014-07-31 Thread Guang Yang
Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
recently but I will try my best to provide a patch as soon as possible).

Thanks,
Guang

在 2014年7月31日,上午1:00,Yehuda Sadeh  写道:

> Can you send this code through a github pull request (or at least as a
> patch)? It'lll be easier to review and comment.
> 
> Thanks,
> Yehuda
> 
> On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang  wrote:
>> +ceph-devel.
>> 
>> Thanks,
>> Guang
>> 
>> On Jul 29, 2014, at 10:20 PM, Guang Yang  wrote:
>> 
>>> Hi Yehuda,
>>> Per you review comment in terms of IO throttling for bucket index 
>>> operation, I prototyped the below code (details still need to polish), can 
>>> you take a look if that is right way to go?
>>> 
>>> Another problem I came across is that 
>>> ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index 
>>> init op (below), is there anything I missed obviously here?
>>> 
>>> Thanks,
>>> Guang
>>> 
>>> 
>>> class ClsBucketIndexAioThrottler {
>>> protected:
>>> int completed;
>>> int ret_code;
>>> IoCtx& io_ctx;
>>> Mutex lock;
>>> struct LockCond {
>>>   Mutex lock;
>>>   Cond cond;
>>>   LockCond() : lock("LockCond"), cond() {}
>>> } lock_cond;
>>> public:
>>> ClsBucketIndexAioThrottler(IoCtx& _io_ctx)
>>>   : completed(0), ret_code(0), io_ctx(_io_ctx),
>>>   lock("ClsBucketIndexAioThrottler"), lock_cond() {}
>>> 
>>> virtual ~ClsBucketIndexAioThrottler() {}
>>> virtual void do_next() = 0;
>>> virtual bool is_completed () = 0;
>>> 
>>> void complete(int ret) {
>>>   {
>>> Mutex::Locker l(lock);
>>> if (ret < 0)
>>>   ret_code = ret;
>>> ++completed;
>>>   }
>>> 
>>>   lock_cond.lock.Lock();
>>>   lock_cond.cond.Signal();
>>>   lock_cond.lock.Unlock();
>>> }
>>> 
>>> int get_ret_code () {
>>>   Mutex::Locker l(lock);
>>>   return ret_code;
>>> }
>>> 
>>> virtual int wait_completion() {
>>>   lock_cond.lock.Lock();
>>>   while (1) {
>>> if (is_completed()) {
>>>   lock_cond.lock.Unlock();
>>>   return ret_code;
>>> }
>>> lock_cond.cond.Wait(lock_cond.lock);
>>> lock_cond.lock.Lock();
>>>   }
>>> }
>>> };
>>> 
>>> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
>>> protected:
>>> vector bucket_objects;
>>> vector::iterator iter_pos;
>>> public:
>>> ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector 
>>> _bucket_objs)
>>>   : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
>>>   iter_pos(bucket_objects.begin()) {}
>>> 
>>> virtual bool is_completed() {
>>>   Mutex::Locker l(lock);
>>>   int sent = 0;
>>>   vector::iterator iter = bucket_objects.begin();
>>>   for (; iter != iter_pos; ++iter) ++sent;
>>> 
>>>   return (sent == completed &&
>>>   (iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 
>>> /*Failure*/));
>>> }
>>> };
>>> 
>>> template
>>> class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
>>> private:
>>> T* data;
>>> // Return code of the operation
>>> int* ret_code;
>>> 
>>> // The Aio completion object associated with this Op, it should
>>> // be release from within the completion handler
>>> librados::AioCompletion* completion;
>>> ClsBucketIndexAioThrottler* throttler;
>>> public:
>>> ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
>>> _completion,
>>> ClsBucketIndexAioThrottler* _throttler)
>>>   : data(_data), ret_code(_ret_code), completion(_completion), 
>>> throttler(_throttler) {}
>>> ~ClsBucketIndexOpCtx() {}
>>> 
>>> // The completion callback, fill the response data
>>> void handle_completion(int r, bufferlist& outbl) {
>>>   if (r >= 0) {
>>> if (data) {
>>>   try {
>>> bufferlist::iterator iter = outbl.begin();
>>> ::decode((*data), iter);
>>>   } catch (buffer::error& err) {
>>> r = -EIO;
>>>   }
>>> }
>&

Re: bucket index sharding - IO throttle

2014-07-30 Thread Guang Yang
+ceph-devel.

Thanks,
Guang

On Jul 29, 2014, at 10:20 PM, Guang Yang  wrote:

> Hi Yehuda,
> Per you review comment in terms of IO throttling for bucket index operation, 
> I prototyped the below code (details still need to polish), can you take a 
> look if that is right way to go?
> 
> Another problem I came across is that ClsBucketIndexOpCtx::handle_compeltion 
> was not called for the bucket index init op (below), is there anything I 
> missed obviously here?
> 
> Thanks,
> Guang
> 
> 
> class ClsBucketIndexAioThrottler {
> protected:
>  int completed;
>  int ret_code;
>  IoCtx& io_ctx;
>  Mutex lock;
>  struct LockCond {
>Mutex lock;
>Cond cond;
>LockCond() : lock("LockCond"), cond() {}
>  } lock_cond;
> public:
>  ClsBucketIndexAioThrottler(IoCtx& _io_ctx)
>: completed(0), ret_code(0), io_ctx(_io_ctx),
>lock("ClsBucketIndexAioThrottler"), lock_cond() {}
> 
>  virtual ~ClsBucketIndexAioThrottler() {}
>  virtual void do_next() = 0;
>  virtual bool is_completed () = 0;
> 
>  void complete(int ret) {
>{
>  Mutex::Locker l(lock);
>  if (ret < 0)
>ret_code = ret;
>  ++completed;
>}
> 
>lock_cond.lock.Lock();
>lock_cond.cond.Signal();
>lock_cond.lock.Unlock();
>  }
> 
>  int get_ret_code () {
>Mutex::Locker l(lock);
>return ret_code;
>  }
> 
>  virtual int wait_completion() {
>lock_cond.lock.Lock();
>while (1) {
>  if (is_completed()) {
>lock_cond.lock.Unlock();
>return ret_code;
>  }
>  lock_cond.cond.Wait(lock_cond.lock);
>  lock_cond.lock.Lock();
>}
>  }
> };
> 
> class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
> protected:
>  vector bucket_objects;
>  vector::iterator iter_pos;
> public:
>  ClsBucketIndexListAioThrottler(IoCtx& _io_ctx, const vector 
> _bucket_objs)
>: ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
>iter_pos(bucket_objects.begin()) {}
> 
>  virtual bool is_completed() {
>Mutex::Locker l(lock);
>int sent = 0;
>vector::iterator iter = bucket_objects.begin();
>for (; iter != iter_pos; ++iter) ++sent;
> 
>return (sent == completed &&
>(iter_pos == bucket_objects.end() /*Success*/ || ret_code < 0 
> /*Failure*/));
>  }
> };
> 
> template
> class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
> private:
>  T* data;
>  // Return code of the operation
>  int* ret_code;
> 
>  // The Aio completion object associated with this Op, it should
>  // be release from within the completion handler
>  librados::AioCompletion* completion;
>  ClsBucketIndexAioThrottler* throttler;
> public:
>  ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
> _completion,
>  ClsBucketIndexAioThrottler* _throttler)
>: data(_data), ret_code(_ret_code), completion(_completion), 
> throttler(_throttler) {}
>  ~ClsBucketIndexOpCtx() {}
> 
>  // The completion callback, fill the response data
>  void handle_completion(int r, bufferlist& outbl) {
>if (r >= 0) {
>  if (data) {
>try {
>  bufferlist::iterator iter = outbl.begin();
>  ::decode((*data), iter);
>} catch (buffer::error& err) {
>  r = -EIO;
>}
>  }
>  // Do the next request
>}
>throttler->do_next();
>throttler->complete(r);
>if (completion) {
>  completion->release();
>}
>  }
> };
> 
> 
> class ClsBucketIndexInitAioThrottler : public ClsBucketIndexListAioThrottler {
> public:
>  ClsBucketIndexInitAioThrottler(IoCtx& _io_ctx, const vector 
> _bucket_objs) :
>ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {}
> 
>  virtual void do_next() {
>string oid;
>{
>  Mutex::Locker l(lock);
>  if (iter_pos == bucket_objects.end())
>return;
>  oid = *(iter_pos++);
>}
>AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, 
> NULL);
>// Dummy
>bufferlist in;
>librados::ObjectWriteOperation op;
>op.create(true);
>op.exec("rgw", "bucket_init_index", in, new ClsBucketIndexOpCtx(NULL, 
> NULL, c, this));
>io_ctx.aio_operate(oid, c, &op, NULL); 
>  }
> };
> 
> 
> int cls_rgw_bucket_index_init_op(librados::IoCtx &io_ctx,
>const vector& bucket_objs, uint32_t max_aio)
> {
>   vector::const_iterator iter = bucket_objs.begin();
>   bufferlist in;
>   ClsBucketIndexAioThrottler* throttler = new 
> ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs);
>   for (; iter != bucket_objs.end() && max_aio-- > 0; ++iter) {
>   throttler->do_next();
>   }
>   throttler->wait_completion();
>   return 0;
> }
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


row geo-replication to another data store?

2014-07-17 Thread Guang Yang
Hi cephers,
We are investigating a backup solution for Ceph, in short, we would like a 
solution to backup a Ceph cluster to another data store (not Ceph cluster, 
assume it has SWIFT API). We would like to have both full backup and 
incremental backup on top of the full backup.

After going through the geo-replication blueprint [1], I am thinking that we 
can leverage the effort and instead of replicate the data into another ceph 
cluster, we make it replicate to another data store. At the same time, I have a 
couple of questions which need your help:

1) How does the ragosgw-agent scale to multiple hosts? Our first investigation 
shows it only works on a single host but I would like to confirm.
2) Can we configure the interval  to do incremental backup like 1 hour / 1 day 
/ 1 month?

[1] 
https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EC pool - empty files in OSD from radosgw

2014-07-14 Thread Guang Yang
Hi Yehuda and Sam,
Any suggestion on top of the below issue?

Thanks,
Guang

On Jul 12, 2014, at 12:43 AM, Guang Yang  wrote:

> Hi Loic,
> I opened an issue in terms of a change brought along with EC pool plus 
> radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we 
> observed a large number of empty files in OSD and the root cause is that for 
> head object from radosgw, there are a couple of transactions coming together, 
> including create 0~0, setxattr, writefull, as EC bring the concept of the 
> object generation, the create transaction will first create an object and the 
> following write full transition will be taken as an update and rename the 
> original empty file to a generation and create/write a new file. As a result, 
> we observed quite some empty files.
> 
> There is an bug tracking the effort to remove those files with generation and 
> is pending to be back port to firefly, that could definitely help our use 
> case, however, I am also wondering if there is any room to improve for such 
> case so that those empty files would not be generated in the first place 
> (change might be at radosgw side).
> 
> Any suggestion is welcomed.
> 
> Thanks,
> Guang

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EC pool - empty files in OSD from radosgw

2014-07-11 Thread Guang Yang
Hi Loic,
I opened an issue in terms of a change brought along with EC pool plus radosgw 
(http://tracker.ceph.com/issues/8625), in our test cluster, we observed a large 
number of empty files in OSD and the root cause is that for head object from 
radosgw, there are a couple of transactions coming together, including create 
0~0, setxattr, writefull, as EC bring the concept of the object generation, the 
create transaction will first create an object and the following write full 
transition will be taken as an update and rename the original empty file to a 
generation and create/write a new file. As a result, we observed quite some 
empty files.

There is an bug tracking the effort to remove those files with generation and 
is pending to be back port to firefly, that could definitely help our use case, 
however, I am also wondering if there is any room to improve for such case so 
that those empty files would not be generated in the first place (change might 
be at radosgw side).

Any suggestion is welcomed.

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.80.2?

2014-07-10 Thread Guang Yang
Hi Sage,
Is it possible to include a fix for this bug - 
http://tracker.ceph.com/issues/8733 considering the scope of the change and 
regression risk for the next release? We are finalizing our production launch 
version and this one is a blocker as we use EC pool.

Thanks,
Guang

On Jul 11, 2014, at 7:31 AM, Sage Weil  wrote:

> We built v0.80.2 yesterday and pushed it out to the repos, but quickly 
> discovered a regression in radosgw that preventing reading objects written 
> with earlier versions.  We pulled the packages, fixed the bug, and are 
> rerunning tests to confirm the fix and ensure there aren't other 
> upgrade-related issues.  We expect to have a v0.80.3 ready tomorrow or 
> Monday.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw - bucket index sharding merge back

2014-07-07 Thread Guang Yang
Hi Yehuda,
I am trying to find a way to merge back the bucket index sharding effort, and 
with more experience working at Ceph, I realized that the original commit was 
too huge which introduced trouble for review. I am thinking to break it down 
into multiple small commits and commit back with a number of patches. I have 
two questions here:

1) Have you looked at the patch and any suggestion I should pay attention to 
when doing the split?
2) Can we merge back with a series of patches (e.g. several commits one patch)?

Any suggestion that I should pay attention to so as to drive this effort into 
completion?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw - bucket index sharding merge back

2014-07-07 Thread Guang Yang
Hi Yehuda,
I am trying to find a way to merge back the bucket index sharding effort, and 
with more experience working at Ceph, I realized that the original commit was 
too huge which introduced trouble for review. I am thinking to break it down 
into multiple small commits and commit back with a number of patches. I have 
two questions here:

1) Have you looked at the patch and any suggestion I should pay attention to 
when doing the split?
2) Can we merge back with a series of patches (e.g. several commits one patch)?

Any suggestion that I should pay attention to so as to drive this effort into 
completion?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CDS G/H - bucket index sharding

2014-06-23 Thread Guang Yang
Thanks Yehuda, my comments inline...
On Jun 23, 2014, at 10:44 PM, Yehuda Sadeh  wrote:

> On Mon, Jun 23, 2014 at 4:11 AM, Guang Yang  wrote:
>> Hello Yehuda,
>> I drafted a brief summary for the status of the bucket index sharding 
>> blueprint and put it here - 
>> http://pad.ceph.com/p/GH-bucket-index-scalability, it would be nice you can 
>> take a look to see if there is anything I missed, I also posted the pull 
>> request here - https://github.com/ceph/ceph/pull/2013.
> 
> Just one note regarding the blueprint, other BI log operations will
> need to use the new schema too (e.g., log trim operations).
Yeah, that has been implemented, thanks for pointing it out.
> 
> I was thinking a bit about how to do resizing and dynamic sharding
> later on. My thought was that we'd have two bucket prefixes: one for
> read and delete operations, and one for read, write and delete
> operations. Normally both will point at the same prefix and we'll just
> access a single one. But when we're resizing we'll need to use both.
> If we're listing objects we'll access both sets of shards and merge
> everything). If we're creating object we'll just create it in the
> second one. Removing object, we'll remove it from both.
> The above description is a bit vague, and shouldn't really change what
> we do now. Just that the implementation needs to maybe abstract that
> bucket access decision nicely so that in the future we could implement
> this easily.
Considering the tradeoff we have with multiple shards for bucket index object, 
we are not likely to create a large number of shards (except we add something 
like per shard listing), thus it might make sense to start with the upper bound 
directly (e.g. 50), it might be good enough for most use cases. Another 
direction we may explore is to let user specify the number of shards (e.g. via 
user defined metadata), when he/she has an estimation of the number of objects 
for a bucket.

As for dynamic bucket, I think there are two options, one is with no data 
migration when changing the number of shards (thus there might be multiple 
version of truth), another is to have data migration. The approach mentioned 
above is the first one, we should be able to implement the above approach with 
some aggregation at client side for multiple version of truth.
> 
> Sadly I'll be off for this CDS, but I'm sure Josh, Greg, Sage, and
> others will be able to help there.
> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


XFS - number of files in a directory

2014-06-23 Thread Guang Yang
Hello Cephers,
We used to have a Ceph cluster and setup our data pool as 3 replicas, we 
estimated the number of files (given disk size and object size) for each PG was 
around 8K, we disabled folder splitting which mean all files located at the 
root PG folder. Our testing showed a good performance with such setup.

Right now we are evaluating erasure coding, which split the object into a 
number of chunks and increase the number of files several times, although XFS 
claims a good support for large directories [1], some testing also showed that 
we may expect performance degradation for large directories.

I would like to check with your experience on top of this for your Ceph cluster 
if you are using XFS. Thanks.

[1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CDS G/H - bucket index sharding

2014-06-23 Thread Guang Yang
Hello Yehuda,
I drafted a brief summary for the status of the bucket index sharding blueprint 
and put it here - http://pad.ceph.com/p/GH-bucket-index-scalability, it would 
be nice you can take a look to see if there is anything I missed, I also posted 
the pull request here - https://github.com/ceph/ceph/pull/2013.

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PG folder splitting proposal

2014-06-23 Thread Guang Yang
Hi Sage,
Would you please help to comment on the proposal of comment 7 of this ticket - 
http://tracker.ceph.com/issues/7593#note-7 ? As we move to EC pool, the number 
of files for each PG increase X times which makes folder splitting more likely 
to happen. Do you think it is a worthy change to do pre-splitting during pool 
creation time with a hint as pool meta-data?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changes of scrubbing?

2014-06-14 Thread Guang Yang

On Jun 11, 2014, at 10:02 PM, Gregory Farnum  wrote:

> On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang  wrote:
>> On Jun 11, 2014, at 6:33 AM, Gregory Farnum  wrote:
>> 
>>> On Tue, May 20, 2014 at 6:44 PM, Guang Yang  wrote:
>>>> Hi ceph-devel,
>>>> Like some users of Ceph, we are using Ceph for a latency sensitive 
>>>> project, and scrubbing (especially deep-scrubbing) impacts the SLA in a 
>>>> non-trivial way, as commodity hardware could fail in one way or the other, 
>>>> I think it is essential to have scrubbing enabled to preserve data 
>>>> durability.
>>>> 
>>>> Inspired by how erasure coding backend implement scrubbing[1], I am 
>>>> wondering if the following changes is valid to somehow reduce the 
>>>> performance impact from scrubbing:
>>>> 1. Store the CRC checksum along with each physical copy of the object on 
>>>> filesystem (via xattr or omap?)
>>>> 2. For read request, it checks the CRC locally and if it mismatch, 
>>>> redirect the request to a replica and mark the PG as inconsistent.
>>> 
>>> The problem with this is that you need to maintain the CRC across
>>> partial overwrites of the object. And the real cost of scrubbing isn't
>>> in the network traffic, it's in the disk reads, which you would have
>>> to do anyway with this method. :)
>> Thanks Greg for the response!
>> Partial update is the right concern if that happens frequently. However, the 
>> major benefit of this proposal is to postpone the CRC check to READ request 
>> instead of doing it from within a background job (although we may still need 
>> to do background check as deep-scrubbing, we can reduce the frequency 
>> dramatically). By checking the CRC at read time, in-consistent object are 
>> marked as inconsistent (PG) and further we can trigger a repair for the PG.
> 
> Oh, I see.
> Still, partial update is in fact the major concern. We have a debug
> mechanism called "sloppy crc" or similar that keeps track of them for
> full (or sufficiently large?) writes, but it's not something you can
> use on production cluster because it turns every write into a
> read-modify-write cycle, and that's just prohibitively expensive (in
> addition to issues with stuff like OSD restart, I think). This sort of
> thing would make sense for the erasure-coded pools; maybe that would
> be a better place to start?
Yeah, that sounds like a good starting point, let me see if I can spend some 
time doing a simple POC.
Thanks Greg.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changes of scrubbing?

2014-06-11 Thread Guang Yang
On Jun 11, 2014, at 6:33 AM, Gregory Farnum  wrote:

> On Tue, May 20, 2014 at 6:44 PM, Guang Yang  wrote:
>> Hi ceph-devel,
>> Like some users of Ceph, we are using Ceph for a latency sensitive project, 
>> and scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial 
>> way, as commodity hardware could fail in one way or the other, I think it is 
>> essential to have scrubbing enabled to preserve data durability.
>> 
>> Inspired by how erasure coding backend implement scrubbing[1], I am 
>> wondering if the following changes is valid to somehow reduce the 
>> performance impact from scrubbing:
>> 1. Store the CRC checksum along with each physical copy of the object on 
>> filesystem (via xattr or omap?)
>> 2. For read request, it checks the CRC locally and if it mismatch, redirect 
>> the request to a replica and mark the PG as inconsistent.
> 
> The problem with this is that you need to maintain the CRC across
> partial overwrites of the object. And the real cost of scrubbing isn't
> in the network traffic, it's in the disk reads, which you would have
> to do anyway with this method. :)
Thanks Greg for the response!
Partial update is the right concern if that happens frequently. However, the 
major benefit of this proposal is to postpone the CRC check to READ request 
instead of doing it from within a background job (although we may still need to 
do background check as deep-scrubbing, we can reduce the frequency 
dramatically). By checking the CRC at read time, in-consistent object are 
marked as inconsistent (PG) and further we can trigger a repair for the PG.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
>> 
>> This is just a general idea and details (like append) will need to further 
>> discussed.
>> 
>> By having such, we can schedule the scrubbing less aggresively but still 
>> preserve the durability for read.
>> 
>> Does this make some sense?
>> 
>> [1] http://ceph.com/docs/master/dev/osd_internals/erasure_coding/pgbackend/
>> 
>> Thanks,
>> Guang Yang--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-06-06 Thread Guang Yang
Hi Yehuda,
Can you take a look at a very high level of the code change, here is the pull 
request - https://github.com/ceph/ceph/pull/1929.

If things look good to you, i will continue the effort and make it more 
clear/complete by end of next week.

Thanks,
Guang

On Jun 2, 2014, at 9:37 PM, Guang Yang  wrote:

> Hi Yehuda and Sage,
> Can you help to comment on the ticket, I would like to send out a pull 
> request some time this week for you to review, but before that, it would be 
> nice to see your comments in terms of the interface and any other concerns 
> you may have for this. Thanks.
> 
> Thanks,
> Guang
> 
> 
> On May 30, 2014, at 8:35 AM, Guang Yang  wrote:
> 
>> Hi Yehuda,
>> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
>> review and comment.
>> 
>> Thanks,
>> Guang
>> 
>> On May 19, 2014, at 2:47 PM, Yehuda Sadeh  wrote:
>> 
>>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang  wrote:
>>>> On May 19, 2014, at 7:05 AM, Sage Weil  wrote:
>>>> 
>>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>>> radosgw is using the omap key/value API for objects, which is more or 
>>>>>>>> less
>>>>>>>> equivalent to what swift is doing with sqlite.  This data passes 
>>>>>>>> straight
>>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>>> crush map so that the rgw index pool is mapped to a different set of 
>>>>>>>> OSDs
>>>>>>>> with the better k/v backend.
>>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>>> implementation is that they are using a table for bucket index and it
>>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>>> though at certain point the sql table would result in performance
>>>>>> degradation as well.
>>>>> 
>>>>> As I understand it the same limitation is present there too: the index is
>>>>> in a single sqlite table.
>>>>> 
>>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>>> how to handle the transition in which we shard buckets.
>>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>>> writes updating the same bucket index object will need to be
>>>>>> serialized).
>>>>> 
>>>>> Given how infrequent container listings are, pre-sharding containers
>>>>> across several objects makes some sense.  Paying the cost of doing
>>>>> listings in parallel across N (where N is not too big) is not a big price
>>>>> to pay. However, there will always need to be a way to re-shard further
>>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>>> be support for static sharding where the number of shards is specified at
>>>>> container/bucket creation time…
>>>> Considering the scope of the change, I also think this is a good starting 
>>>> point to make the bucket index updating more scalable.
>>>> Yehuda,
>>>> How do you think?
>>> 
>>> Sharding it will help with scaling it up to a certain point. As Sage
>>> mentioned we can start with a static setting as a first simpler
>>> approach, and move into a dynamic approach later on.
>>> 
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-06-02 Thread Guang Yang
Hi Yehuda and Sage,
Can you help to comment on the ticket, I would like to send out a pull request 
some time this week for you to review, but before that, it would be nice to see 
your comments in terms of the interface and any other concerns you may have for 
this. Thanks.

Thanks,
Guang


On May 30, 2014, at 8:35 AM, Guang Yang  wrote:

> Hi Yehuda,
> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
> review and comment.
> 
> Thanks,
> Guang
> 
> On May 19, 2014, at 2:47 PM, Yehuda Sadeh  wrote:
> 
>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang  wrote:
>>> On May 19, 2014, at 7:05 AM, Sage Weil  wrote:
>>> 
>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>> radosgw is using the omap key/value API for objects, which is more or 
>>>>>>> less
>>>>>>> equivalent to what swift is doing with sqlite.  This data passes 
>>>>>>> straight
>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>> crush map so that the rgw index pool is mapped to a different set of 
>>>>>>> OSDs
>>>>>>> with the better k/v backend.
>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>> implementation is that they are using a table for bucket index and it
>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>> though at certain point the sql table would result in performance
>>>>> degradation as well.
>>>> 
>>>> As I understand it the same limitation is present there too: the index is
>>>> in a single sqlite table.
>>>> 
>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>> how to handle the transition in which we shard buckets.
>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>> writes updating the same bucket index object will need to be
>>>>> serialized).
>>>> 
>>>> Given how infrequent container listings are, pre-sharding containers
>>>> across several objects makes some sense.  Paying the cost of doing
>>>> listings in parallel across N (where N is not too big) is not a big price
>>>> to pay. However, there will always need to be a way to re-shard further
>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>> be support for static sharding where the number of shards is specified at
>>>> container/bucket creation time…
>>> Considering the scope of the change, I also think this is a good starting 
>>> point to make the bucket index updating more scalable.
>>> Yehuda,
>>> How do you think?
>> 
>> Sharding it will help with scaling it up to a certain point. As Sage
>> mentioned we can start with a static setting as a first simpler
>> approach, and move into a dynamic approach later on.
>> 
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-05-29 Thread Guang Yang
Hi Yehuda,
I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
review and comment.

Thanks,
Guang

On May 19, 2014, at 2:47 PM, Yehuda Sadeh  wrote:

> On Sun, May 18, 2014 at 11:18 PM, Guang Yang  wrote:
>> On May 19, 2014, at 7:05 AM, Sage Weil  wrote:
>> 
>>> On Sun, 18 May 2014, Guang wrote:
>>>>>> radosgw is using the omap key/value API for objects, which is more or 
>>>>>> less
>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>> with the better k/v backend.
>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>> implementation is that they are using a table for bucket index and it
>>>> actually can be updated in parallel which makes more scalable for write,
>>>> though at certain point the sql table would result in performance
>>>> degradation as well.
>>> 
>>> As I understand it the same limitation is present there too: the index is
>>> in a single sqlite table.
>>> 
>>>>> My more well-formed opinion is that we need to come up with a good
>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>> shrink), and I assume there would be some kind of background operation
>>>>> that will enable that. I also believe that making it hash based is the
>>>>> way to go. It looks like that the more complicated issue is here is
>>>>> how to handle the transition in which we shard buckets.
>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>> shard at the very beginning (the problem we are facing is parallel
>>>> writes updating the same bucket index object will need to be
>>>> serialized).
>>> 
>>> Given how infrequent container listings are, pre-sharding containers
>>> across several objects makes some sense.  Paying the cost of doing
>>> listings in parallel across N (where N is not too big) is not a big price
>>> to pay. However, there will always need to be a way to re-shard further
>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>> be support for static sharding where the number of shards is specified at
>>> container/bucket creation time…
>> Considering the scope of the change, I also think this is a good starting 
>> point to make the bucket index updating more scalable.
>> Yehuda,
>> How do you think?
> 
> Sharding it will help with scaling it up to a certain point. As Sage
> mentioned we can start with a static setting as a first simpler
> approach, and move into a dynamic approach later on.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: erasure code & reliability model

2014-05-26 Thread Guang Yang
Thanks Koleos!

Guang
On May 26, 2014, at 7:02 PM, Koleos Fuskus  wrote:

> Hello Guang,
> 
> Here is the wiki: 
> https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_modeling
> 
> Koleos
> 
> 
> 
> On Monday, May 26, 2014 10:05 AM, Guang Yang  wrote:
> Hello Loic and Koleos,
> Do we have any wiki page documenting the progress and report of this effort? 
> We are very interested in such as well.
> 
> Thanks,
> Guang
> 
> 
> 
> 
> On May 8, 2014, at 1:19 AM, Loic Dachary  wrote:
> 
>> 
>> Hi,
>> 
>> On 07/05/2014 18:43, Koleos Fuskus wrote:> Hi Loic,
>>> 
>>> What do you mean by action plan? if is the schedule, it is published on my 
>>> proposal in melange site. Indeed, the details of my proposal are private, 
>>> if is that what you mean, I can added on the wiki. If you want this part to 
>>> be public no problem. 
>> 
>> Yes, that's what I meant :-) You could collect snippets from the proposal 
>> posted on melange and present them on a subpage 
>> https://wiki.ceph.com/Development and we can use this as the "home page" of 
>> your work ? 
>> 
>>> Actually I am using some of my free time on reading ceph documentation (web 
>>> and papers). Do you have any specific document to recommend me? Maybe we 
>>> can discuss again about the durability model on Friday/Monday.  I need some 
>>> more understanding about the ceph architecture.
>> 
>> I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to 
>> chat during this "community bonding" phase (I don't remember how google 
>> calls this for GSoC participants ;-)
>> 
>> Cheers
>> 
>>> 
>>> Cheers,
>>> Verónica
>>> On Wednesday, May 7, 2014 8:12 AM, Loic Dachary  wrote:
>>> Hi Veronica,
>>> 
>>> I was really happy to hear you're going to work on the erasure code 
>>> reliability model. Unless I'm mistaken your action plan was not published. 
>>> Would you mind adding it to the wiki so I can comment on it ? Someone else 
>>> might be interested to contribute too. I've had a discussion yesterday 
>>> about durability models (internally) and it is not well understood. Your 
>>> insight would be precious.
>>> 
>>> Cheers
>>> 
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>> 
>>> 
>> 
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: erasure code & reliability model

2014-05-26 Thread Guang Yang
Hello Loic and Koleos,
Do we have any wiki page documenting the progress and report of this effort? We 
are very interested in such as well.

Thanks,
Guang

On May 8, 2014, at 1:19 AM, Loic Dachary  wrote:

> 
> Hi,
> 
> On 07/05/2014 18:43, Koleos Fuskus wrote:> Hi Loic,
>> 
>> What do you mean by action plan? if is the schedule, it is published on my 
>> proposal in melange site. Indeed, the details of my proposal are private, if 
>> is that what you mean, I can added on the wiki. If you want this part to be 
>> public no problem. 
> 
> Yes, that's what I meant :-) You could collect snippets from the proposal 
> posted on melange and present them on a subpage 
> https://wiki.ceph.com/Development and we can use this as the "home page" of 
> your work ? 
> 
>> Actually I am using some of my free time on reading ceph documentation (web 
>> and papers). Do you have any specific document to recommend me? Maybe we can 
>> discuss again about the durability model on Friday/Monday.  I need some more 
>> understanding about the ceph architecture.
> 
> I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to 
> chat during this "community bonding" phase (I don't remember how google calls 
> this for GSoC participants ;-)
> 
> Cheers
> 
>> 
>> Cheers,
>> Verónica
>> On Wednesday, May 7, 2014 8:12 AM, Loic Dachary  wrote:
>> Hi Veronica,
>> 
>> I was really happy to hear you're going to work on the erasure code 
>> reliability model. Unless I'm mistaken your action plan was not published. 
>> Would you mind adding it to the wiki so I can comment on it ? Someone else 
>> might be interested to contribute too. I've had a discussion yesterday about 
>> durability models (internally) and it is not well understood. Your insight 
>> would be precious.
>> 
>> Cheers
>> 
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> 
>> 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions of KeyValueStore (leveldb) backend

2014-05-25 Thread Guang Yang
Thanks!
On May 26, 2014, at 12:55 PM, Haomai Wang  wrote:

> On Mon, May 26, 2014 at 9:46 AM, Guang Yang  wrote:
>> Hello Haomai,
>> We are evaluating the key-value store backend which comes along with Firefly 
>> release (thanks for implementing it in Ceph), it is very promising for a 
>> couple of our use cases, after going through the related code change, I have 
>> a couple of questions which needs your help:
>>  1. One observation is that, for object larger than 1KB, it will be striped 
>> to multiple chunks (k-v in the leveldb table), with one strip as 1KB size. 
>> Is there any particular reason we choose 1KB as the strip size (and I didn’t 
>> find a configuration to tune this value)?
> 
> 1KB is not a serious value, this value can be configured in the near future.
> 
>> 
>>  2. This is properly a leveldb question, do we expect performance 
>> degradation as the leveldb instance keeps increasing (e.g. several TB)?
> 
> Ceph OSD is expected to own a physical disk, normally is several
> TB(1-4TB). LevelDB can take it easy. Especially we use it to store
> large value(compared to normal application usage).
> 
>> 
>> Thanks,
>> Guang
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions of KeyValueStore (leveldb) backend

2014-05-25 Thread Guang Yang
Hello Haomai,
We are evaluating the key-value store backend which comes along with Firefly 
release (thanks for implementing it in Ceph), it is very promising for a couple 
of our use cases, after going through the related code change, I have a couple 
of questions which needs your help:
  1. One observation is that, for object larger than 1KB, it will be striped to 
multiple chunks (k-v in the leveldb table), with one strip as 1KB size. Is 
there any particular reason we choose 1KB as the strip size (and I didn’t find 
a configuration to tune this value)?

  2. This is properly a leveldb question, do we expect performance degradation 
as the leveldb instance keeps increasing (e.g. several TB)?

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Changes of scrubbing?

2014-05-20 Thread Guang Yang
Hi ceph-devel,
Like some users of Ceph, we are using Ceph for a latency sensitive project, and 
scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial way, as 
commodity hardware could fail in one way or the other, I think it is essential 
to have scrubbing enabled to preserve data durability.

Inspired by how erasure coding backend implement scrubbing[1], I am wondering 
if the following changes is valid to somehow reduce the performance impact from 
scrubbing:
 1. Store the CRC checksum along with each physical copy of the object on 
filesystem (via xattr or omap?)
 2. For read request, it checks the CRC locally and if it mismatch, redirect 
the request to a replica and mark the PG as inconsistent.

This is just a general idea and details (like append) will need to further 
discussed.

By having such, we can schedule the scrubbing less aggresively but still 
preserve the durability for read.

Does this make some sense?

[1] http://ceph.com/docs/master/dev/osd_internals/erasure_coding/pgbackend/ 

Thanks,
Guang Yang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-05-18 Thread Guang Yang
On May 19, 2014, at 7:05 AM, Sage Weil  wrote:

> On Sun, 18 May 2014, Guang wrote:
 radosgw is using the omap key/value API for objects, which is more or less
 equivalent to what swift is doing with sqlite.  This data passes straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of OSDs
 with the better k/v backend.
>> Not sure if I miss anything, but the key difference with SWIFT?s 
>> implementation is that they are using a table for bucket index and it 
>> actually can be updated in parallel which makes more scalable for write, 
>> though at certain point the sql table would result in performance 
>> degradation as well.
> 
> As I understand it the same limitation is present there too: the index is 
> in a single sqlite table.
> 
>>> My more well-formed opinion is that we need to come up with a good
>>> design. It needs to be flexible enough to be able to grow (and maybe
>>> shrink), and I assume there would be some kind of background operation
>>> that will enable that. I also believe that making it hash based is the
>>> way to go. It looks like that the more complicated issue is here is
>>> how to handle the transition in which we shard buckets.
>> Yeah I agree. I think the conflicting goals here are, we want a sorted 
>> list (so that it enable prefix scan for listing purpose) and we want to 
>> shard at the very beginning (the problem we are facing is parallel 
>> writes updating the same bucket index object will need to be 
>> serialized).
> 
> Given how infrequent container listings are, pre-sharding containers 
> across several objects makes some sense.  Paying the cost of doing 
> listings in parallel across N (where N is not too big) is not a big price 
> to pay. However, there will always need to be a way to re-shard further 
> when containers/buckets get extremely big.  Perhaps a starting point would 
> be support for static sharding where the number of shards is specified at 
> container/bucket creation time…
Considering the scope of the change, I also think this is a good starting point 
to make the bucket index updating more scalable.
Yehuda,
How do you think?
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html