Re: Long peering - throttle at FileStore::queue_transactions

2016-01-05 Thread Guang Yang
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <s...@newdream.net> wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and 
>> > persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Guang Yang
Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and 
> persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are 
> peered?

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Thanks Sam for the confirmation.

Thanks,
Guang

On Mon, Jan 4, 2016 at 3:59 PM, Samuel Just <sj...@redhat.com> wrote:
> IIRC, you are running giant.  I think that's the log rotate dangling
> fd bug (not fixed in giant since giant is eol).  Fixed upstream
> 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is
> b8e3f6e190809febf80af66415862e7c7e415214.
> -Sam
>
> On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang <guan...@gmail.com> wrote:
>> Hi Cephers,
>> Before I open a tracker, I would like check if it is a known issue or not..
>>
>> One one of our clusters, there was OSD crash during repairing,  the
>> crash happened after we issued a PG repair for inconsistent PGs, which
>> failed because the recorded file size (within xattr) mismatched with
>> the actual file size.
>>
>> The mismatch was caused by the fact that the content of the data file
>> are OSD logs, following is from osd.354 on c003:
>>
>> -rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
>> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
>> -bash-4.1$ head
>> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
>> 2016-01-03 07:30:01.600119 7f7fe2096700 15
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
>> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
>> 2016-01-03 07:30:01.604967 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
>> 2016-01-03 07:30:01.604984 7f7fe2096700 10
>> filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
>> 2016-01-03 07:30:01.604986 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.idtag'
>> 2016-01-03 07:30:01.604996 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
>> 2016-01-03 07:30:01.605007 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'snapset'
>> 2016-01-03 07:30:01.605013 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.manifest'
>> 2016-01-03 07:30:01.605026 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> 'hinfo_key'
>> 2016-01-03 07:30:01.605042 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.x-amz-meta-origin'
>> 2016-01-03 07:30:01.605049 7f7fe2096700 20
>> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
>> '_user.rgw.acl'
>>
>>
>> This only happens on the clusters we turned on the verbose log
>> (debug_osd/filestore=20). And we are running ceph v0.87.
>>
>> Thanks,
>> Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD data file are OSD logs

2016-01-04 Thread Guang Yang
Hi Cephers,
Before I open a tracker, I would like check if it is a known issue or not..

One one of our clusters, there was OSD crash during repairing,  the
crash happened after we issued a PG repair for inconsistent PGs, which
failed because the recorded file size (within xattr) mismatched with
the actual file size.

The mismatch was caused by the fact that the content of the data file
are OSD logs, following is from osd.354 on c003:

-rw-r--r-- 1 yahoo root  75168 Jan  3 07:30
default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7
-bash-4.1$ head
"default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7"
2016-01-03 07:30:01.600119 7f7fe2096700 15
filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs
3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7
2016-01-03 07:30:01.604967 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, len is 494
2016-01-03 07:30:01.604984 7f7fe2096700 10
filestore(/home/y/var/lib/ceph/osd/ceph-354)  -ERANGE, got 247
2016-01-03 07:30:01.604986 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.idtag'
2016-01-03 07:30:01.604996 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_'
2016-01-03 07:30:01.605007 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'snapset'
2016-01-03 07:30:01.605013 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.manifest'
2016-01-03 07:30:01.605026 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'hinfo_key'
2016-01-03 07:30:01.605042 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.x-amz-meta-origin'
2016-01-03 07:30:01.605049 7f7fe2096700 20
filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting
'_user.rgw.acl'


This only happens on the clusters we turned on the verbose log
(debug_osd/filestore=20). And we are running ceph v0.87.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-16 Thread Guang Yang
I spoke to a leveldb expert, it looks like this is a known pattern on
LSM tree data structure - the tail latency for range scan could be far
longer than avg/median since it might need to mmap several sst files
to get the record.

Hi Sage,
Do you see any harm to increase the default value for this setting
(e.g. 20 minutes)? Or should I add the advise for monitor
trouble-shooting?

Thanks,
Guang

On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang <guan...@gmail.com> wrote:
> Thanks Sage! I will definitely try those patches.
>
> For this one, I finally managed to bring the new monitor in by
> increasing the mon_sync_timeout from its default 60 to 6 to make
> sure the syncing does not restart and result in an infinite loop..
>
> On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil <s...@newdream.net> wrote:
>> On Fri, 13 Nov 2015, Guang Yang wrote:
>>> Thanks Sage!
>>>
>>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <s...@newdream.net> wrote:
>>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>>> >> I was wrong the previous analysis, it was not the iterator got reset,
>>> >> the problem I can see now, is that during the syncing, a new round of
>>> >> election kicked off and thus it needs to probe the newly added
>>> >> monitor, however, since it hasn't been synced yet, it will restart the
>>> >> syncing from there.
>>> >
>>> > What version of this?  I think this is something we fixed a while back?
>>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>>> a commit I can take a look?
>>
>> Hrm, I guess it was way befoer that.. I'm thinking of
>> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
>> not sure exactly why an election would be restarting the sync in your
>> case..
>>
>> You mentioned elsewhere that your mon store was very large, though (more
>> than 10's of GB), which suggests you might be hitting the
>> min_last_epoch_clean problem (which prevents osdmap trimming).. see
>> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
>> and firefly but not giant.
>>
>> sage
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-16 Thread Guang Yang
On Mon, Nov 16, 2015 at 5:42 PM, Sage Weil <s...@newdream.net> wrote:
> On Mon, 16 Nov 2015, Guang Yang wrote:
>> I spoke to a leveldb expert, it looks like this is a known pattern on
>> LSM tree data structure - the tail latency for range scan could be far
>> longer than avg/median since it might need to mmap several sst files
>> to get the record.
>>
>> Hi Sage,
>> Do you see any harm to increase the default value for this setting
>> (e.g. 20 minutes)? Or should I add the advise for monitor
>> trouble-shooting?
>
> The timeout is just for a round trip for the sync process, right?  I think
> increasing it a bit (2x or 3x?) is okay, but 20 minutes to do a single
> chunk is a lot.
Yeah the timeout is for a single round trip (there is timeout reset
mechanism at both sides).
>
> The underlying problem in your cases is that your store is huge (by ~2
> orders of magnitude), so I'm not sure we should tune against that :)
Ok, let me apply the patches and monitor the db growth.
>
> sage
>
>
>  >
>> Thanks,
>> Guang
>>
>> On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang <guan...@gmail.com> wrote:
>> > Thanks Sage! I will definitely try those patches.
>> >
>> > For this one, I finally managed to bring the new monitor in by
>> > increasing the mon_sync_timeout from its default 60 to 6 to make
>> > sure the syncing does not restart and result in an infinite loop..
>> >
>> > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil <s...@newdream.net> wrote:
>> >> On Fri, 13 Nov 2015, Guang Yang wrote:
>> >>> Thanks Sage!
>> >>>
>> >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <s...@newdream.net> wrote:
>> >>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>> >>> >> I was wrong the previous analysis, it was not the iterator got reset,
>> >>> >> the problem I can see now, is that during the syncing, a new round of
>> >>> >> election kicked off and thus it needs to probe the newly added
>> >>> >> monitor, however, since it hasn't been synced yet, it will restart the
>> >>> >> syncing from there.
>> >>> >
>> >>> > What version of this?  I think this is something we fixed a while back?
>> >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>> >>> a commit I can take a look?
>> >>
>> >> Hrm, I guess it was way befoer that.. I'm thinking of
>> >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
>> >> not sure exactly why an election would be restarting the sync in your
>> >> case..
>> >>
>> >> You mentioned elsewhere that your mon store was very large, though (more
>> >> than 10's of GB), which suggests you might be hitting the
>> >> min_last_epoch_clean problem (which prevents osdmap trimming).. see
>> >> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
>> >> and firefly but not giant.
>> >>
>> >> sage
>> >>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2015-11-13 Thread Guang Yang
Hi Joao,
We have a problem when trying to add new monitors to the cluster on an
unhealthy cluster, which I would like ask for your suggestion.

After adding the new monitor, it  started syncing the store and went
into an infinite loop:

2015-11-12 21:02:23.499510 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 929616 bytes last_key
osdmap,full_22530) v2
2015-11-12 21:02:23.712944 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 799897 bytes last_key
osdmap,full_3259) v2


We talked early in the morning on IRC, and at the time I thought it
was because the osdmap epoch was increasing, which lead to this
infinite loop.

I then set those nobackfill/norecovery flags and the osdmap epoch
freezed, however, the problem is still there.

While the osdmap epoch is 22531, the switch always happened at
osdmap.full_22530 (as showed by the above log).

Looking at the code at both sides, it looks this check
(https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
always true, and I can confirm from the log that (sp.last_commited <
paxos->get_version()) was false, so the chance is that the
sp.synchronizer always has next chunk?

Does this look familiar to you? Or any other trouble shoot I can try?
Thanks very much.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
Thanks Sage! I will definitely try those patches.

For this one, I finally managed to bring the new monitor in by
increasing the mon_sync_timeout from its default 60 to 6 to make
sure the syncing does not restart and result in an infinite loop..

On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil <s...@newdream.net> wrote:
> On Fri, 13 Nov 2015, Guang Yang wrote:
>> Thanks Sage!
>>
>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <s...@newdream.net> wrote:
>> > On Fri, 13 Nov 2015, Guang Yang wrote:
>> >> I was wrong the previous analysis, it was not the iterator got reset,
>> >> the problem I can see now, is that during the syncing, a new round of
>> >> election kicked off and thus it needs to probe the newly added
>> >> monitor, however, since it hasn't been synced yet, it will restart the
>> >> syncing from there.
>> >
>> > What version of this?  I think this is something we fixed a while back?
>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
>> a commit I can take a look?
>
> Hrm, I guess it was way befoer that.. I'm thinking of
> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
> not sure exactly why an election would be restarting the sync in your
> case..
>
> You mentioned elsewhere that your mon store was very large, though (more
> than 10's of GB), which suggests you might be hitting the
> min_last_epoch_clean problem (which prevents osdmap trimming).. see
> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
> and firefly but not giant.
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
I was wrong the previous analysis, it was not the iterator got reset,
the problem I can see now, is that during the syncing, a new round of
election kicked off and thus it needs to probe the newly added
monitor, however, since it hasn't been synced yet, it will restart the
syncing from there.

Hi Sage and Joao,
Is there a way to freeze the election by some tunable to let the sync finish?

Thanks,
Guang

On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang <guan...@gmail.com> wrote:
> Hi Joao,
> We have a problem when trying to add new monitors to the cluster on an
> unhealthy cluster, which I would like ask for your suggestion.
>
> After adding the new monitor, it  started syncing the store and went
> into an infinite loop:
>
> 2015-11-12 21:02:23.499510 7f1e8030e700 10
> mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> cookie 4513071120 lc 14697737 bl 929616 bytes last_key
> osdmap,full_22530) v2
> 2015-11-12 21:02:23.712944 7f1e8030e700 10
> mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> cookie 4513071120 lc 14697737 bl 799897 bytes last_key
> osdmap,full_3259) v2
>
>
> We talked early in the morning on IRC, and at the time I thought it
> was because the osdmap epoch was increasing, which lead to this
> infinite loop.
>
> I then set those nobackfill/norecovery flags and the osdmap epoch
> freezed, however, the problem is still there.
>
> While the osdmap epoch is 22531, the switch always happened at
> osdmap.full_22530 (as showed by the above log).
>
> Looking at the code at both sides, it looks this check
> (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
> always true, and I can confirm from the log that (sp.last_commited <
> paxos->get_version()) was false, so the chance is that the
> sp.synchronizer always has next chunk?
>
> Does this look familiar to you? Or any other trouble shoot I can try?
> Thanks very much.
>
> Thanks,
> Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newly added monitor infinitely sync store

2015-11-13 Thread Guang Yang
Thanks Sage!

On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <s...@newdream.net> wrote:
> On Fri, 13 Nov 2015, Guang Yang wrote:
>> I was wrong the previous analysis, it was not the iterator got reset,
>> the problem I can see now, is that during the syncing, a new round of
>> election kicked off and thus it needs to probe the newly added
>> monitor, however, since it hasn't been synced yet, it will restart the
>> syncing from there.
>
> What version of this?  I think this is something we fixed a while back?
This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
a commit I can take a look?

>
>> Hi Sage and Joao,
>> Is there a way to freeze the election by some tunable to let the sync finish?
>
> We can't not do elections when something is asking for one (e.g., mon
> is down).
I see. Is there an operational workaround we could try? From within
the log, I found the election was triggered by accepted timeout, thus
I increased the timeout value to hopefully squeeze election during
syncing, does that sounds a workaround?
>
> sage
>
>
>
>>
>> Thanks,
>> Guang
>>
>> On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang <guan...@gmail.com> wrote:
>> > Hi Joao,
>> > We have a problem when trying to add new monitors to the cluster on an
>> > unhealthy cluster, which I would like ask for your suggestion.
>> >
>> > After adding the new monitor, it  started syncing the store and went
>> > into an infinite loop:
>> >
>> > 2015-11-12 21:02:23.499510 7f1e8030e700 10
>> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
>> > cookie 4513071120 lc 14697737 bl 929616 bytes last_key
>> > osdmap,full_22530) v2
>> > 2015-11-12 21:02:23.712944 7f1e8030e700 10
>> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
>> > cookie 4513071120 lc 14697737 bl 799897 bytes last_key
>> > osdmap,full_3259) v2
>> >
>> >
>> > We talked early in the morning on IRC, and at the time I thought it
>> > was because the osdmap epoch was increasing, which lead to this
>> > infinite loop.
>> >
>> > I then set those nobackfill/norecovery flags and the osdmap epoch
>> > freezed, however, the problem is still there.
>> >
>> > While the osdmap epoch is 22531, the switch always happened at
>> > osdmap.full_22530 (as showed by the above log).
>> >
>> > Looking at the code at both sides, it looks this check
>> > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
>> > always true, and I can confirm from the log that (sp.last_commited <
>> > paxos->get_version()) was false, so the chance is that the
>> > sp.synchronizer always has next chunk?
>> >
>> > Does this look familiar to you? Or any other trouble shoot I can try?
>> > Thanks very much.
>> >
>> > Thanks,
>> > Guang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbolic links like feature on radosgw

2015-11-02 Thread Guang Yang
Hi Yehuda,
We have a user requirement that needs symbolic link like feature on
radosgw - two object ids pointing to the same object (ideally it could
cross bucket, but same bucket is fine).

The closest feature on Amazon S3 I could find is [1], but not exact
the same, the one from Amazon S3 API was designed for static web site
hosting.

Is this a valid feature request we can put into radosgw? The way I am
thinking to implement is like symbolic link, the link object just
contains a pointer to the original object.

 [1] http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html

 --
 Regards,
 Guang




-- 
--
Regards,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weekly performance meeting

2014-09-26 Thread Guang Yang

On Sep 26, 2014, at 9:12 PM, Mark Nelson mark.nel...@inktank.com wrote:

 On 09/25/2014 09:47 PM, Guang Yang wrote:
 Hi Sage,
 We are very interested to join (and contribute effort) as well. Following 
 are a list of issues we have particular interests:
  1 Large number of small files bring performance degradation most due to 
 file system lookup (even worst with EC).
 
 Have you tried decreasing vfs_cache_pressure to retain dentries and inodes in 
 cache?  I've had good luck improve performance for medium sized IO workloads 
 doing this.
Yeah we changed the setting from its default value 100 to 20 and it turned out 
improvement for dentry/inode cache (we also tried setting it to 1 but got OOM 
in some traffic pattern). Even with the setting change, given the object size 
is several hundred KB, we still observed lookup miss which increase latency, 
this became worst when we turned to EC as: 1) More files on each system. 2) The 
long tail determine the latency.
 
  2 Messenger uses too many threads which bring burden for high density 
 hardware (which I believe Haomai already has great progress).
 
 Yes, The biggest thing on my personal wish list has been to move to a hybrid 
 threading/event processing model.
 
 
 Thanks,
 Guang
 
 On Sep 26, 2014, at 2:27 AM, Sage Weil sw...@redhat.com wrote:
 
 Hi everyone,
 
 A number of people have approached me about how to get more involved with
 the current work on improving performance and how to better coordinate
 with other interested parties.  A few meetings have taken place offline
 with good results but only a few interested parties were involved.
 
 Ideally, we'd like to move as much of this dicussion into the public
 forums: ceph-devel@vger.kernel.org and #ceph-devel.  That isn't always
 sufficient, however.  I'd like to also set up a regular weekly meeting
 using google hangouts or bluejeans so that all interested parties can
 share progress.  There are a lot of things we can do during the Hammer
 cycle to improve things but it will require some coordination of effort.
 
 Among other things, we can discuss:
 
 - observed performance limitations
 - high level strategies for addressing them
 - proposed patch sets and their performance impact
 - anything else that will move us forward
 
 One challenge is timezones: there are developers in the US, China, Europe,
 and Israel who may want to join.  As a starting point, how about next
 Wednesday, 15:00 UTC?  If I didn't do my tz math wrong, that's
 
  8:00 (PDT, California)
 15:00 (UTC)
 18:00 (IDT, Israel)
 23:00 (CST, China)
 
 That is surely not the ideal time for everyone but it can hopefully be a
 starting point.
 
 I've also created an etherpad for collecting discussion/agenda items at
 
 http://pad.ceph.com/p/performance_weekly
 
 Is there interest here?  Please let everyone know if you are actively
 working in this area and/or would like to join, and update the pad above
 with the topics you would like to discuss.
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weekly performance meeting

2014-09-25 Thread Guang Yang
Hi Sage,
We are very interested to join (and contribute effort) as well. Following are a 
list of issues we have particular interests:
 1 Large number of small files bring performance degradation most due to file 
system lookup (even worst with EC).
 2 Messenger uses too many threads which bring burden for high density 
hardware (which I believe Haomai already has great progress).

Thanks,
Guang

On Sep 26, 2014, at 2:27 AM, Sage Weil sw...@redhat.com wrote:

 Hi everyone,
 
 A number of people have approached me about how to get more involved with 
 the current work on improving performance and how to better coordinate 
 with other interested parties.  A few meetings have taken place offline 
 with good results but only a few interested parties were involved.
 
 Ideally, we'd like to move as much of this dicussion into the public 
 forums: ceph-devel@vger.kernel.org and #ceph-devel.  That isn't always 
 sufficient, however.  I'd like to also set up a regular weekly meeting 
 using google hangouts or bluejeans so that all interested parties can 
 share progress.  There are a lot of things we can do during the Hammer 
 cycle to improve things but it will require some coordination of effort.
 
 Among other things, we can discuss:
 
 - observed performance limitations
 - high level strategies for addressing them
 - proposed patch sets and their performance impact
 - anything else that will move us forward
 
 One challenge is timezones: there are developers in the US, China, Europe, 
 and Israel who may want to join.  As a starting point, how about next 
 Wednesday, 15:00 UTC?  If I didn't do my tz math wrong, that's
 
  8:00 (PDT, California)
 15:00 (UTC)
 18:00 (IDT, Israel)
 23:00 (CST, China)
 
 That is surely not the ideal time for everyone but it can hopefully be a 
 starting point.
 
 I've also created an etherpad for collecting discussion/agenda items at
 
   http://pad.ceph.com/p/performance_weekly
 
 Is there interest here?  Please let everyone know if you are actively 
 working in this area and/or would like to join, and update the pad above 
 with the topics you would like to discuss.
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RGW threads hung - more logs

2014-09-11 Thread Guang Yang
Hi Sage, Sam and Greg,
With the radosgw hung issue we discussed this today, I finally got some more 
logs showing that the reply message has been received by ragosgw, but failed to 
be dispatched as dispatcher thread was hung. I put all the logs into the 
tracker - http://tracker.ceph.com/issues/9008

While the logs explain what we observed, I failed to find any clue that why the 
dispatcher would need to wait for objecter_bytes throttler budget, did I miss 
anything obvious here?

Tracker link - http://tracker.ceph.com/issues/9008

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Issue - 8907

2014-08-28 Thread Guang Yang
Hi Loic,
Can you help to take a quick look over this issue - 
http://tracker.ceph.com/issues/8907, was it a design choice due to consistency 
concern?

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-20 Thread Guang Yang
Thanks Greg.
On Aug 20, 2014, at 6:09 AM, Gregory Farnum g...@inktank.com wrote:

 On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang yguan...@outlook.com wrote:
 Hi ceph-devel,
 David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we 
 came across in our test cluster during our failure testing, basically the 
 way to reproduce it was to leave one OSD daemon down and in for a day, at 
 the same time, keep giving write traffic. When the OSD daemon was started 
 again, it hit suicide timeout and kill itself.
 
 After some analysis (details in the bug), David found that the op thread was 
 busy searching for missing objects and once the volume to search increase, 
 the thread is expected to work that long time, please refer to the bug for 
 detailed logs.
 
 Can you talk a little more about what's going on here? At a quick
 naive glance, I'm not seeing why leaving an OSD down and in should
 require work based on the amount of write traffic. Perhaps if the rest
 of the cluster was changing mappings…?
We increased the down to out time interval from 5 minutes to 2 days to avoid 
migrating data back and forth which could increase latency, so that we target 
to mark OSD out manually. To achieve such, we are testing against some boundary 
cases to let the OSD down and in for like 1 day, however, when we try to bring 
it up again, it always failed due to hit the suicide timeout.
 
 
 One simple fix is to let the op thread reset the suicide timeout 
 periodically when it is doing long-time work, other fix might be to cut the 
 work into smaller pieces?
 
 We do both of those things throughout the OSD (although I think the
 first is simpler and more common); search for the accesses to
 cct-get_heartbeat_map()-reset_timeout.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-20 Thread Guang Yang
Thanks Sage. We will provide a patch based on this.

Thanks,
Guang

On Aug 20, 2014, at 11:19 PM, Sage Weil sw...@redhat.com wrote:

 On Wed, 20 Aug 2014, Guang Yang wrote:
 Thanks Greg.
 On Aug 20, 2014, at 6:09 AM, Gregory Farnum g...@inktank.com wrote:
 
 On Mon, Aug 18, 2014 at 11:30 PM, Guang Yang yguan...@outlook.com wrote:
 Hi ceph-devel,
 David (cc?ed) reported a bug (http://tracker.ceph.com/issues/9128) which 
 we came across in our test cluster during our failure testing, basically 
 the way to reproduce it was to leave one OSD daemon down and in for a day, 
 at the same time, keep giving write traffic. When the OSD daemon was 
 started again, it hit suicide timeout and kill itself.
 
 After some analysis (details in the bug), David found that the op thread 
 was busy searching for missing objects and once the volume to search 
 increase, the thread is expected to work that long time, please refer to 
 the bug for detailed logs.
 
 Can you talk a little more about what's going on here? At a quick
 naive glance, I'm not seeing why leaving an OSD down and in should
 require work based on the amount of write traffic. Perhaps if the rest
 of the cluster was changing mappings??
 We increased the down to out time interval from 5 minutes to 2 days to 
 avoid migrating data back and forth which could increase latency, so 
 that we target to mark OSD out manually. To achieve such, we are testing 
 against some boundary cases to let the OSD down and in for like 1 day, 
 however, when we try to bring it up again, it always failed due to hit 
 the suicide timeout.
 
 Looking at the log snippet I see the PG had log range
 
   5481'28667,5646'34066
 
 Which is ~5500 log events.  The default max is 10k.  search_for_missing is 
 basically going to iterate over this list and check if the object is 
 present locally.
 
 If that's slow enough to trigger a suicide (which it seems to be), teh 
 fix is simple: as Greg says we just need to make it probe the internel 
 heartbeat code to indicate progress.  In most contexts this is done by 
 passing a ThreadPool::TPHandle handle into each method and then 
 calling handle.reset_tp_timeout() on each iteration.  The same needs to be 
 done for search_for_missing...
 
 sage
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD suicide after being down/in for one day as it needs to search large amount of objects

2014-08-19 Thread Guang Yang
Hi ceph-devel,
David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we 
came across in our test cluster during our failure testing, basically the way 
to reproduce it was to leave one OSD daemon down and in for a day, at the same 
time, keep giving write traffic. When the OSD daemon was started again, it hit 
suicide timeout and kill itself.

After some analysis (details in the bug), David found that the op thread was 
busy searching for missing objects and once the volume to search increase, the 
thread is expected to work that long time, please refer to the bug for detailed 
logs.

One simple fix is to let the op thread reset the suicide timeout periodically 
when it is doing long-time work, other fix might be to cut the work into 
smaller pieces?

Any suggestion is welcome.

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-17 Thread Guang Yang
Hi Huamin,
Then it might be a totally different issue than the one I mentioned below, 
please file a bug to http://tracker.ceph.com/ with more details (the log before 
the daemon crashed).

Thanks,
Guang

On Aug 16, 2014, at 5:36 AM, Huamin Chen hc...@redhat.com wrote:

 Thanks. I was running a single node ceph fs cluster on a VM. Each time the VM 
 is created, it downloads the latest bits and runs unit tests. There are many 
 mount and unmount during the tests.
 This issue can be reliably reproduced in one of these tests.
 
 The test info can be found 
 
 
 - Original Message -
 From: Guang Yang yguan...@outlook.com
 To: Huamin Chen hc...@redhat.com
 Cc: Ceph-devel ceph-devel@vger.kernel.org
 Sent: Friday, August 15, 2014 2:23:12 PM
 Subject: Re: assert failure
 
 + ceph-devel.
 
 Hi Huamin,
 Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if 
 its peer has the old version, it could crash the new version as well.
 
 Thanks,
 Guang
 
 On Aug 14, 2014, at 11:21 PM, Huamin Chen hc...@redhat.com wrote:
 
 Bad news, still there ...
 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time 
 2014-08-14 15:16:44.659312
 msg/Pipe.cc: 1080: FAILED assert(m)
 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac]
 2: (Pipe::writer()+0x9f3) [0x7f327552aff3]
 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d]
 4: (()+0x79d1) [0x7f32953449d1]
 5: (clone()+0x6d) [0x7f3294c89b5d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 
 Attached please find all related logs
 
 - Original Message -
 From: Guang Yang yguan...@outlook.com
 To: Huamin Chen hc...@redhat.com
 Cc: ceph-devel@vger.kernel.org
 Sent: Wednesday, August 13, 2014 10:39:10 PM
 Subject: Re: assert failure
 
 Hi Huamin,
 At least one known issue in 0.80.1 with the same failing pattern has been 
 fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
 http://tracker.ceph.com/issues/8232.
 
 Can you compare the log snippets from within the bug and see if they are the 
 same issue?
 
 Thanks,
 Guang
 
 On Aug 14, 2014, at 4:29 AM, Huamin Chen hc...@redhat.com wrote:
 
 Is the following assert failure an known issue?
 
 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
 2014-08-13 16:26:06.039799
 msg/Pipe.cc: 1070: FAILED assert(m)
 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
 4: (()+0x7df3) [0x7fef336cadf3]
 5: (clone()+0x6d) [0x7fef32fe63dd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 ceph-error-log.tgz
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-15 Thread Guang Yang
+ ceph-devel.

Hi Huamin,
Did you upgrade the entire cluster to v0.80.5? If I remember correctly, if its 
peer has the old version, it could crash the new version as well.

Thanks,
Guang

On Aug 14, 2014, at 11:21 PM, Huamin Chen hc...@redhat.com wrote:

 Bad news, still there ...
 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f30c4511700 time 
 2014-08-14 15:16:44.659312
 msg/Pipe.cc: 1080: FAILED assert(m)
 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
 1: (Pipe::connect()+0x3d0c) [0x7f327552a2ac]
 2: (Pipe::writer()+0x9f3) [0x7f327552aff3]
 3: (Pipe::Writer::entry()+0xd) [0x7f327553748d]
 4: (()+0x79d1) [0x7f32953449d1]
 5: (clone()+0x6d) [0x7f3294c89b5d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 
 Attached please find all related logs
 
 - Original Message -
 From: Guang Yang yguan...@outlook.com
 To: Huamin Chen hc...@redhat.com
 Cc: ceph-devel@vger.kernel.org
 Sent: Wednesday, August 13, 2014 10:39:10 PM
 Subject: Re: assert failure
 
 Hi Huamin,
 At least one known issue in 0.80.1 with the same failing pattern has been 
 fixed in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
 http://tracker.ceph.com/issues/8232.
 
 Can you compare the log snippets from within the bug and see if they are the 
 same issue?
 
 Thanks,
 Guang
 
 On Aug 14, 2014, at 4:29 AM, Huamin Chen hc...@redhat.com wrote:
 
 Is the following assert failure an known issue?
 
 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
 2014-08-13 16:26:06.039799
 msg/Pipe.cc: 1070: FAILED assert(m)
 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
 4: (()+0x7df3) [0x7fef336cadf3]
 5: (clone()+0x6d) [0x7fef32fe63dd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 ceph-error-log.tgz

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD disk replacement best practise

2014-08-14 Thread Guang Yang
Hi cephers,
Most recently I am drafting the run books for OSD disk replacement, I think the 
rule of thumb is to reduce data migration (recover/backfill), and I thought the 
following procedure should achieve the purpose:
  1. ceph osd out osd.XXX (mark it out to trigger data migration)
  2. ceph osd rm osd.XXX
  3. ceph auth rm osd.XXX
  4. provision a new OSD which will take XXX as the OSD id and migrate data 
back.

With the above procedure, the crush weight of the host never changed so that we 
can limit the data migration only for those which are neccesary.

Does it make sense?

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assert failure

2014-08-13 Thread Guang Yang
Hi Huamin,
At least one known issue in 0.80.1 with the same failing pattern has been fixed 
in the latest 0.80.4 release of firefly. Here is the tracking ticket - 
http://tracker.ceph.com/issues/8232.

Can you compare the log snippets from within the bug and see if they are the 
same issue?

Thanks,
Guang

On Aug 14, 2014, at 4:29 AM, Huamin Chen hc...@redhat.com wrote:

 Is the following assert failure an known issue?
 
 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fed3d2dd700 time 
 2014-08-13 16:26:06.039799
 msg/Pipe.cc: 1070: FAILED assert(m)
 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (Pipe::connect()+0x390e) [0x7feee89cf99e]
 2: (Pipe::writer()+0x511) [0x7feee89d0fd1]
 3: (Pipe::Writer::entry()+0xd) [0x7feee89d5d0d]
 4: (()+0x7df3) [0x7fef336cadf3]
 5: (clone()+0x6d) [0x7fef32fe63dd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bucket index sharding - IO throttle

2014-08-12 Thread Guang Yang
Hi Yehuda,
Can you help to review the latest patch with throttle mechanism you suggested. 
Thanks!

Thanks,
Guang
On Aug 4, 2014, at 3:20 PM, Guang Yang yguan...@outlook.com wrote:

 Hi Yehuda,
 Here is the new pull request - https://github.com/ceph/ceph/pull/2187
 
 Thanks,
 Guang
 On Jul 31, 2014, at 10:40 PM, Guang Yang yguan...@outlook.com wrote:
 
 Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
 recently but I will try my best to provide a patch as soon as possible).
 
 Thanks,
 Guang
 
 在 2014年7月31日,上午1:00,Yehuda Sadeh yeh...@inktank.com 写道:
 
 Can you send this code through a github pull request (or at least as a
 patch)? It'lll be easier to review and comment.
 
 Thanks,
 Yehuda
 
 On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang yguan...@outlook.com wrote:
 +ceph-devel.
 
 Thanks,
 Guang
 
 On Jul 29, 2014, at 10:20 PM, Guang Yang yguan...@outlook.com wrote:
 
 Hi Yehuda,
 Per you review comment in terms of IO throttling for bucket index 
 operation, I prototyped the below code (details still need to polish), 
 can you take a look if that is right way to go?
 
 Another problem I came across is that 
 ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket 
 index init op (below), is there anything I missed obviously here?
 
 Thanks,
 Guang
 
 
 class ClsBucketIndexAioThrottler {
 protected:
 int completed;
 int ret_code;
 IoCtx io_ctx;
 Mutex lock;
 struct LockCond {
 Mutex lock;
 Cond cond;
 LockCond() : lock(LockCond), cond() {}
 } lock_cond;
 public:
 ClsBucketIndexAioThrottler(IoCtx _io_ctx)
 : completed(0), ret_code(0), io_ctx(_io_ctx),
 lock(ClsBucketIndexAioThrottler), lock_cond() {}
 
 virtual ~ClsBucketIndexAioThrottler() {}
 virtual void do_next() = 0;
 virtual bool is_completed () = 0;
 
 void complete(int ret) {
 {
   Mutex::Locker l(lock);
   if (ret  0)
 ret_code = ret;
   ++completed;
 }
 
 lock_cond.lock.Lock();
 lock_cond.cond.Signal();
 lock_cond.lock.Unlock();
 }
 
 int get_ret_code () {
 Mutex::Locker l(lock);
 return ret_code;
 }
 
 virtual int wait_completion() {
 lock_cond.lock.Lock();
 while (1) {
   if (is_completed()) {
 lock_cond.lock.Unlock();
 return ret_code;
   }
   lock_cond.cond.Wait(lock_cond.lock);
   lock_cond.lock.Lock();
 }
 }
 };
 
 class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
 protected:
 vectorstring bucket_objects;
 vectorstring::iterator iter_pos;
 public:
 ClsBucketIndexListAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs)
 : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
 iter_pos(bucket_objects.begin()) {}
 
 virtual bool is_completed() {
 Mutex::Locker l(lock);
 int sent = 0;
 vectorstring::iterator iter = bucket_objects.begin();
 for (; iter != iter_pos; ++iter) ++sent;
 
 return (sent == completed 
 (iter_pos == bucket_objects.end() /*Success*/ || ret_code  0 
 /*Failure*/));
 }
 };
 
 templatetypename T
 class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
 private:
 T* data;
 // Return code of the operation
 int* ret_code;
 
 // The Aio completion object associated with this Op, it should
 // be release from within the completion handler
 librados::AioCompletion* completion;
 ClsBucketIndexAioThrottler* throttler;
 public:
 ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
 _completion,
   ClsBucketIndexAioThrottler* _throttler)
 : data(_data), ret_code(_ret_code), completion(_completion), 
 throttler(_throttler) {}
 ~ClsBucketIndexOpCtx() {}
 
 // The completion callback, fill the response data
 void handle_completion(int r, bufferlist outbl) {
 if (r = 0) {
   if (data) {
 try {
   bufferlist::iterator iter = outbl.begin();
   ::decode((*data), iter);
 } catch (buffer::error err) {
   r = -EIO;
 }
   }
   // Do the next request
 }
 throttler-do_next();
 throttler-complete(r);
 if (completion) {
   completion-release();
 }
 }
 };
 
 
 class ClsBucketIndexInitAioThrottler : public 
 ClsBucketIndexListAioThrottler {
 public:
 ClsBucketIndexInitAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs) :
 ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {}
 
 virtual void do_next() {
 string oid;
 {
   Mutex::Locker l(lock);
   if (iter_pos == bucket_objects.end())
 return;
   oid = *(iter_pos++);
 }
 AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, 
 NULL);
 // Dummy
 bufferlist in;
 librados::ObjectWriteOperation op;
 op.create(true);
 op.exec(rgw, bucket_init_index, in, new 
 ClsBucketIndexOpCtxint(NULL, NULL, c, this));
 io_ctx.aio_operate(oid, c, op, NULL);
 }
 };
 
 
 int cls_rgw_bucket_index_init_op(librados::IoCtx io_ctx,
 const vectorstring bucket_objs, uint32_t max_aio)
 {
 vectorstring::const_iterator iter = bucket_objs.begin();
 bufferlist in;
 ClsBucketIndexAioThrottler* throttler = new 
 ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs);
 for (; iter != bucket_objs.end()  max_aio--  0; ++iter) {
throttler-do_next

Re: bucket index sharding - IO throttle

2014-08-06 Thread Guang Yang
Hi Osier,
I doubt the issue is related (the error message is connection failure), the 
below patch is pretty simple (and incomplete), what it does is to add a 
configuration to bucket meta info so that we can configure the number of shards 
on bucket basis (again, this is not included in the patch).

The patch should be completely backward compatible which means if you don’t 
change the number of shards configuration, nothing should be changed for bucket 
creation/listing.

My plan is to use this patch as a starting point to review, as the key building 
blocks are included in the patch and once it is passed review, I will create a 
bunch of following patches to implement the feature completely (which are 
mostly done in the previous big patch - https://github.com/ceph/ceph/pull/2013).

I tested the patch locally in my cluster and it looks good for bucket creation.

Thanks,
Guang

On Aug 6, 2014, at 12:38 PM, Osier Yang os...@yunify.com wrote:

 
 On 2014年08月04日 15:20, Guang Yang wrote:
 Hi Yehuda,
 Here is the new pull request - https://github.com/ceph/ceph/pull/2187
 
 I simply applied the patch on git top, and the testing shows
 rest-bench is completely
 broken with the 2 patches:
 
 
 root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
 --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
 --secret=lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1 --protocol=http
 --uri_style=path --bucket=cool0 --seconds=20 --concurrent-ios=50
 --block-size=204800 --show-time write
 host=testing-s3gw0
 2014-08-06 12:28:56.500235 7f1336645780 -1 did not load config file,
 using default settings.
 ERROR: failed to create bucket: ConnectionFailed
 failed initializing benchmark
 
 The related debug log entry:
 
 2014-08-06 12:29:48.137559 7fea62fcd700 20 state for
 obj=.rgw:.bucket.meta.rest-bench-bucket:default.9738.2 is not atomic,
 not appending atomic test
 
 After a short time, all the memory was eaten up:
 
 root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
 --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
 --secret=lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1 --protocol=http
 --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800
 --show-time write
 -bash: fork: Cannot allocate memory
 root@testing-s3gw0:~/s3-tests# /usr/bin/rest-bench
 --api-host=testing-s3gw0 --access-key=93EEF3F5O7VY89Q2GSWC
 --secret=lf2bwxiRf1e9/nrOTCZyN/HgTqCz7XwrB2LDocY1 --protocol=http
 --uri_style=path --seconds=20 --concurrent-ios=50 --block-size=204800
 --show-time write
 -bash: fork: Cannot allocate memory
 root@testing-s3gw0:~/s3-tests# free
 -bash: fork: Cannot allocate memory
 
 A few mins later, the VM is completely unresponsible. And I had to
 destroy it and restart again.
 
 Guang, how was your testing when creating the patches?
 
 
 
 Thanks,
 Guang
 On Jul 31, 2014, at 10:40 PM, Guang Yang yguan...@outlook.com wrote:
 
 Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
 recently but I will try my best to provide a patch as soon as possible).
 
 Thanks,
 Guang
 
 在 2014年7月31日,上午1:00,Yehuda Sadeh yeh...@inktank.com 写道:
 
 Can you send this code through a github pull request (or at least as a
 patch)? It'lll be easier to review and comment.
 
 Thanks,
 Yehuda
 
 On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang yguan...@outlook.com wrote:
 +ceph-devel.
 
 Thanks,
 Guang
 
 On Jul 29, 2014, at 10:20 PM, Guang Yang yguan...@outlook.com wrote:
 
 Hi Yehuda,
 Per you review comment in terms of IO throttling for bucket index 
 operation, I prototyped the below code (details still need to polish), 
 can you take a look if that is right way to go?
 
 Another problem I came across is that 
 ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket 
 index init op (below), is there anything I missed obviously here?
 
 Thanks,
 Guang
 
 
 class ClsBucketIndexAioThrottler {
 protected:
 int completed;
 int ret_code;
 IoCtx io_ctx;
 Mutex lock;
 struct LockCond {
 Mutex lock;
 Cond cond;
 LockCond() : lock(LockCond), cond() {}
 } lock_cond;
 public:
 ClsBucketIndexAioThrottler(IoCtx _io_ctx)
 : completed(0), ret_code(0), io_ctx(_io_ctx),
 lock(ClsBucketIndexAioThrottler), lock_cond() {}
 
 virtual ~ClsBucketIndexAioThrottler() {}
 virtual void do_next() = 0;
 virtual bool is_completed () = 0;
 
 void complete(int ret) {
 {
   Mutex::Locker l(lock);
   if (ret  0)
 ret_code = ret;
   ++completed;
 }
 
 lock_cond.lock.Lock();
 lock_cond.cond.Signal();
 lock_cond.lock.Unlock();
 }
 
 int get_ret_code () {
 Mutex::Locker l(lock);
 return ret_code;
 }
 
 virtual int wait_completion() {
 lock_cond.lock.Lock();
 while (1) {
   if (is_completed()) {
 lock_cond.lock.Unlock();
 return ret_code;
   }
   lock_cond.cond.Wait(lock_cond.lock);
   lock_cond.lock.Lock();
 }
 }
 };
 
 class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler 
 {
 protected:
 vectorstring bucket_objects;
 vectorstring::iterator iter_pos;
 public:
 ClsBucketIndexListAioThrottler(IoCtx _io_ctx, const

Re: bucket index sharding - IO throttle

2014-08-04 Thread Guang Yang
Hi Yehuda,
Here is the new pull request - https://github.com/ceph/ceph/pull/2187

Thanks,
Guang
On Jul 31, 2014, at 10:40 PM, Guang Yang yguan...@outlook.com wrote:

 Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
 recently but I will try my best to provide a patch as soon as possible).
 
 Thanks,
 Guang
 
 在 2014年7月31日,上午1:00,Yehuda Sadeh yeh...@inktank.com 写道:
 
 Can you send this code through a github pull request (or at least as a
 patch)? It'lll be easier to review and comment.
 
 Thanks,
 Yehuda
 
 On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang yguan...@outlook.com wrote:
 +ceph-devel.
 
 Thanks,
 Guang
 
 On Jul 29, 2014, at 10:20 PM, Guang Yang yguan...@outlook.com wrote:
 
 Hi Yehuda,
 Per you review comment in terms of IO throttling for bucket index 
 operation, I prototyped the below code (details still need to polish), can 
 you take a look if that is right way to go?
 
 Another problem I came across is that 
 ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index 
 init op (below), is there anything I missed obviously here?
 
 Thanks,
 Guang
 
 
 class ClsBucketIndexAioThrottler {
 protected:
 int completed;
 int ret_code;
 IoCtx io_ctx;
 Mutex lock;
 struct LockCond {
  Mutex lock;
  Cond cond;
  LockCond() : lock(LockCond), cond() {}
 } lock_cond;
 public:
 ClsBucketIndexAioThrottler(IoCtx _io_ctx)
  : completed(0), ret_code(0), io_ctx(_io_ctx),
  lock(ClsBucketIndexAioThrottler), lock_cond() {}
 
 virtual ~ClsBucketIndexAioThrottler() {}
 virtual void do_next() = 0;
 virtual bool is_completed () = 0;
 
 void complete(int ret) {
  {
Mutex::Locker l(lock);
if (ret  0)
  ret_code = ret;
++completed;
  }
 
  lock_cond.lock.Lock();
  lock_cond.cond.Signal();
  lock_cond.lock.Unlock();
 }
 
 int get_ret_code () {
  Mutex::Locker l(lock);
  return ret_code;
 }
 
 virtual int wait_completion() {
  lock_cond.lock.Lock();
  while (1) {
if (is_completed()) {
  lock_cond.lock.Unlock();
  return ret_code;
}
lock_cond.cond.Wait(lock_cond.lock);
lock_cond.lock.Lock();
  }
 }
 };
 
 class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
 protected:
 vectorstring bucket_objects;
 vectorstring::iterator iter_pos;
 public:
 ClsBucketIndexListAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs)
  : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
  iter_pos(bucket_objects.begin()) {}
 
 virtual bool is_completed() {
  Mutex::Locker l(lock);
  int sent = 0;
  vectorstring::iterator iter = bucket_objects.begin();
  for (; iter != iter_pos; ++iter) ++sent;
 
  return (sent == completed 
  (iter_pos == bucket_objects.end() /*Success*/ || ret_code  0 
 /*Failure*/));
 }
 };
 
 templatetypename T
 class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
 private:
 T* data;
 // Return code of the operation
 int* ret_code;
 
 // The Aio completion object associated with this Op, it should
 // be release from within the completion handler
 librados::AioCompletion* completion;
 ClsBucketIndexAioThrottler* throttler;
 public:
 ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
 _completion,
ClsBucketIndexAioThrottler* _throttler)
  : data(_data), ret_code(_ret_code), completion(_completion), 
 throttler(_throttler) {}
 ~ClsBucketIndexOpCtx() {}
 
 // The completion callback, fill the response data
 void handle_completion(int r, bufferlist outbl) {
  if (r = 0) {
if (data) {
  try {
bufferlist::iterator iter = outbl.begin();
::decode((*data), iter);
  } catch (buffer::error err) {
r = -EIO;
  }
}
// Do the next request
  }
  throttler-do_next();
  throttler-complete(r);
  if (completion) {
completion-release();
  }
 }
 };
 
 
 class ClsBucketIndexInitAioThrottler : public 
 ClsBucketIndexListAioThrottler {
 public:
 ClsBucketIndexInitAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs) :
  ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {}
 
 virtual void do_next() {
  string oid;
  {
Mutex::Locker l(lock);
if (iter_pos == bucket_objects.end())
  return;
oid = *(iter_pos++);
  }
  AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, 
 NULL);
  // Dummy
  bufferlist in;
  librados::ObjectWriteOperation op;
  op.create(true);
  op.exec(rgw, bucket_init_index, in, new 
 ClsBucketIndexOpCtxint(NULL, NULL, c, this));
  io_ctx.aio_operate(oid, c, op, NULL);
 }
 };
 
 
 int cls_rgw_bucket_index_init_op(librados::IoCtx io_ctx,
  const vectorstring bucket_objs, uint32_t max_aio)
 {
 vectorstring::const_iterator iter = bucket_objs.begin();
 bufferlist in;
 ClsBucketIndexAioThrottler* throttler = new 
 ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs);
 for (; iter != bucket_objs.end()  max_aio--  0; ++iter) {
 throttler-do_next();
 }
 throttler-wait_completion();
 return 0;
 }
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel

Re: KeyFileStore ?

2014-08-04 Thread Guang Yang
在 2014年8月2日,上午5:34,Samuel Just sam.j...@inktank.com 写道:

 Sage's basic approach sounds about right to me.  I'm fairly skeptical
 about the benefits of packing small objects together within larger
 files, though.  It seems like for very small objects, we would be
 better off stashing the contents opportunistically within the onode.
I really like this idea, for radosgw + EC use case, there are lots of small 
physical files generated (multiple Kbs), and when the OSD disk is filled to a 
certain ratio, each read to one chunk could incur several disk I/Os (path 
lookup and data), and putting the data as part of onode could boost the read 
performance and as the same time, decrease the number of physical files.
 For somewhat larger objects, it seems like the complexity of
 maintaining information about the larger pack objects would be
 equivalent to the what the filesystem would do anyway.
 -Sam
 
 On Fri, Aug 1, 2014 at 8:08 AM, Guang Yang yguan...@outlook.com wrote:
 I really like the idea, one scenario keeps bothering us is that there are 
 too many small files which make the file system indexing slow (so that a 
 single read request could take more than 10 disk IOs for path lookup).
 
 If we pursuit this proposal, is there a chance we can take one step further, 
 that instead of storing one physical file for each object, we can allocate a 
 big file (tens of GB) and each object only map to a chunk within that big 
 file. So that all those big file’s description could be cached to avoid disk 
 I/O to open the file. At least we keep it flexible that if someone would 
 like to implement in such way, there is a chance to leverage the existing 
 framework.
 
 Thanks,
 Guang
 
 On Jul 31, 2014, at 1:25 PM, Sage Weil sw...@redhat.com wrote:
 
 After the latest set of bug fixes to the FileStore file naming code I am
 newly inspired to replace it with something less complex.  Right now I'm
 mostly thinking about HDDs, although some of this may map well onto hybrid
 SSD/HDD as well.  It may or may not make sense for pure flash.
 
 Anyway, here are the main flaws with the overall approach that FileStore
 uses:
 
 - It tries to maintain a direct mapping of object names to file names.
 This is problematic because of 255 character limits, rados namespaces, pg
 prefixes, and the pg directory hashing we do to allow efficient split, for
 starters.  It is also problematic because we often want to do things like
 rename but can't make it happen atomically in combination with the rest of
 our transaction.
 
 - The PG directory hashing (that we do to allow efficient split) can have
 a big impact on performance, particularly when injesting lots of data.
 (And when benchmarking.)  It's also complex.
 
 - We often overwrite or replace entire objects.  These are easy
 operations to do safely without doing complete data journaling, but the
 current design is not conducive to doing anything clever (and it's complex
 enough that I wouldn't want to add any cleverness on top).
 
 - Objects may contain only key/value data, but we still have to create an
 inode for them and look that up first.  This only matters for some
 workloads (rgw indexes, cephfs directory objects).
 
 Instead, I think we should try a hybrid approach that more heavily
 leverages a key/value db in combination with the file system.  The kv db
 might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just
 assume it provides transactional key/value storage and efficient range
 operations.  Here's the basic idea:
 
 - The mapping from names to object lives in the kv db.  The object
 metadata is in a structure we can call an onode to avoid confusing it
 with the inodes in the backing file system.  The mapping is simple
 ghobject_t - onode map; there is no PG collection.  The PG collection
 still exist but really only as ranges of those keys.  We will need to be
 slightly clever with the coll_t to distinguish between bare PGs (that
 live in this flat mapping) and the other collections (*_temp and
 metadata), but that should be easy.  This makes PG splitting free as far
 as the objects go.
 
 - The onodes are relatively small.  They will contain the xattrs and
 basic metadata like object size.  They will also identify the file name of
 the backing file in the file system (if size  0).
 
 - The backing file can be a random, short file name.  We can just make a
 one or two level deep set of directories, and let the directories get
 reasonably big... whatever we decide the backing fs can handle
 efficiently.  We can also store a file handle in the onode and use the
 open by handle API; this should let us go directly from onode (in our kv
 db) to the on-disk inode without looking at the directory at all, and fall
 back to using the actual file name only if that fails for some reason
 (say, someone mucked around with the backing files).  The backing file
 need not have any xattrs on it at all (except perhaps some simple id to
 verify it does it fact belong

Re: KeyFileStore ?

2014-08-01 Thread Guang Yang
I really like the idea, one scenario keeps bothering us is that there are too 
many small files which make the file system indexing slow (so that a single 
read request could take more than 10 disk IOs for path lookup).

If we pursuit this proposal, is there a chance we can take one step further, 
that instead of storing one physical file for each object, we can allocate a 
big file (tens of GB) and each object only map to a chunk within that big file. 
So that all those big file’s description could be cached to avoid disk I/O to 
open the file. At least we keep it flexible that if someone would like to 
implement in such way, there is a chance to leverage the existing framework.

Thanks,
Guang

On Jul 31, 2014, at 1:25 PM, Sage Weil sw...@redhat.com wrote:

 After the latest set of bug fixes to the FileStore file naming code I am 
 newly inspired to replace it with something less complex.  Right now I'm 
 mostly thinking about HDDs, although some of this may map well onto hybrid 
 SSD/HDD as well.  It may or may not make sense for pure flash.
 
 Anyway, here are the main flaws with the overall approach that FileStore 
 uses:
 
 - It tries to maintain a direct mapping of object names to file names.  
 This is problematic because of 255 character limits, rados namespaces, pg 
 prefixes, and the pg directory hashing we do to allow efficient split, for 
 starters.  It is also problematic because we often want to do things like 
 rename but can't make it happen atomically in combination with the rest of 
 our transaction.
 
 - The PG directory hashing (that we do to allow efficient split) can have 
 a big impact on performance, particularly when injesting lots of data.  
 (And when benchmarking.)  It's also complex.
 
 - We often overwrite or replace entire objects.  These are easy 
 operations to do safely without doing complete data journaling, but the 
 current design is not conducive to doing anything clever (and it's complex 
 enough that I wouldn't want to add any cleverness on top).
 
 - Objects may contain only key/value data, but we still have to create an 
 inode for them and look that up first.  This only matters for some 
 workloads (rgw indexes, cephfs directory objects).
 
 Instead, I think we should try a hybrid approach that more heavily 
 leverages a key/value db in combination with the file system.  The kv db 
 might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just 
 assume it provides transactional key/value storage and efficient range 
 operations.  Here's the basic idea:
 
 - The mapping from names to object lives in the kv db.  The object 
 metadata is in a structure we can call an onode to avoid confusing it 
 with the inodes in the backing file system.  The mapping is simple 
 ghobject_t - onode map; there is no PG collection.  The PG collection 
 still exist but really only as ranges of those keys.  We will need to be 
 slightly clever with the coll_t to distinguish between bare PGs (that 
 live in this flat mapping) and the other collections (*_temp and 
 metadata), but that should be easy.  This makes PG splitting free as far 
 as the objects go.
 
 - The onodes are relatively small.  They will contain the xattrs and 
 basic metadata like object size.  They will also identify the file name of 
 the backing file in the file system (if size  0).
 
 - The backing file can be a random, short file name.  We can just make a 
 one or two level deep set of directories, and let the directories get 
 reasonably big... whatever we decide the backing fs can handle 
 efficiently.  We can also store a file handle in the onode and use the 
 open by handle API; this should let us go directly from onode (in our kv 
 db) to the on-disk inode without looking at the directory at all, and fall 
 back to using the actual file name only if that fails for some reason 
 (say, someone mucked around with the backing files).  The backing file 
 need not have any xattrs on it at all (except perhaps some simple id to 
 verify it does it fact belong to the referring onode, just as a sanity 
 check).
 
 - The name - onode mapping can live in a disjunct part of the kv 
 namespace so that the other kv stuff associated with the file (like omap 
 pairs or big xattrs or whatever) don't blow up those parts of the 
 db and slow down lookup.
 
 - We can keep a simple LRU of recent onodes in memory and avoid the kv 
 lookup for hot objects.
 
 - Previously complicated operations like rename are now trivial: we just 
 update the kv db with a transaction.  The backing file never gets renamed, 
 ever, and the other object omap data is keyed by a unique (onode) id, not 
 the name.
 
 Initially, for simplicity, we can start with the existing data journaling 
 behavior.  However, I think there are opportunities to improve the 
 situation there.  There is a pending wip-transactions branch in which I 
 started to rejigger the ObjectStore::Transaction interface a bit so that 
 you identify objects by handle and 

Re: bucket index sharding - IO throttle

2014-07-31 Thread Guang Yang
Thanks Yehuda. I will do that (sorry I was occupied by some other stuff 
recently but I will try my best to provide a patch as soon as possible).

Thanks,
Guang

在 2014年7月31日,上午1:00,Yehuda Sadeh yeh...@inktank.com 写道:

 Can you send this code through a github pull request (or at least as a
 patch)? It'lll be easier to review and comment.
 
 Thanks,
 Yehuda
 
 On Wed, Jul 30, 2014 at 7:58 AM, Guang Yang yguan...@outlook.com wrote:
 +ceph-devel.
 
 Thanks,
 Guang
 
 On Jul 29, 2014, at 10:20 PM, Guang Yang yguan...@outlook.com wrote:
 
 Hi Yehuda,
 Per you review comment in terms of IO throttling for bucket index 
 operation, I prototyped the below code (details still need to polish), can 
 you take a look if that is right way to go?
 
 Another problem I came across is that 
 ClsBucketIndexOpCtx::handle_compeltion was not called for the bucket index 
 init op (below), is there anything I missed obviously here?
 
 Thanks,
 Guang
 
 
 class ClsBucketIndexAioThrottler {
 protected:
 int completed;
 int ret_code;
 IoCtx io_ctx;
 Mutex lock;
 struct LockCond {
   Mutex lock;
   Cond cond;
   LockCond() : lock(LockCond), cond() {}
 } lock_cond;
 public:
 ClsBucketIndexAioThrottler(IoCtx _io_ctx)
   : completed(0), ret_code(0), io_ctx(_io_ctx),
   lock(ClsBucketIndexAioThrottler), lock_cond() {}
 
 virtual ~ClsBucketIndexAioThrottler() {}
 virtual void do_next() = 0;
 virtual bool is_completed () = 0;
 
 void complete(int ret) {
   {
 Mutex::Locker l(lock);
 if (ret  0)
   ret_code = ret;
 ++completed;
   }
 
   lock_cond.lock.Lock();
   lock_cond.cond.Signal();
   lock_cond.lock.Unlock();
 }
 
 int get_ret_code () {
   Mutex::Locker l(lock);
   return ret_code;
 }
 
 virtual int wait_completion() {
   lock_cond.lock.Lock();
   while (1) {
 if (is_completed()) {
   lock_cond.lock.Unlock();
   return ret_code;
 }
 lock_cond.cond.Wait(lock_cond.lock);
 lock_cond.lock.Lock();
   }
 }
 };
 
 class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
 protected:
 vectorstring bucket_objects;
 vectorstring::iterator iter_pos;
 public:
 ClsBucketIndexListAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs)
   : ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
   iter_pos(bucket_objects.begin()) {}
 
 virtual bool is_completed() {
   Mutex::Locker l(lock);
   int sent = 0;
   vectorstring::iterator iter = bucket_objects.begin();
   for (; iter != iter_pos; ++iter) ++sent;
 
   return (sent == completed 
   (iter_pos == bucket_objects.end() /*Success*/ || ret_code  0 
 /*Failure*/));
 }
 };
 
 templatetypename T
 class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
 private:
 T* data;
 // Return code of the operation
 int* ret_code;
 
 // The Aio completion object associated with this Op, it should
 // be release from within the completion handler
 librados::AioCompletion* completion;
 ClsBucketIndexAioThrottler* throttler;
 public:
 ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
 _completion,
 ClsBucketIndexAioThrottler* _throttler)
   : data(_data), ret_code(_ret_code), completion(_completion), 
 throttler(_throttler) {}
 ~ClsBucketIndexOpCtx() {}
 
 // The completion callback, fill the response data
 void handle_completion(int r, bufferlist outbl) {
   if (r = 0) {
 if (data) {
   try {
 bufferlist::iterator iter = outbl.begin();
 ::decode((*data), iter);
   } catch (buffer::error err) {
 r = -EIO;
   }
 }
 // Do the next request
   }
   throttler-do_next();
   throttler-complete(r);
   if (completion) {
 completion-release();
   }
 }
 };
 
 
 class ClsBucketIndexInitAioThrottler : public 
 ClsBucketIndexListAioThrottler {
 public:
 ClsBucketIndexInitAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs) :
   ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {}
 
 virtual void do_next() {
   string oid;
   {
 Mutex::Locker l(lock);
 if (iter_pos == bucket_objects.end())
   return;
 oid = *(iter_pos++);
   }
   AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, 
 NULL);
   // Dummy
   bufferlist in;
   librados::ObjectWriteOperation op;
   op.create(true);
   op.exec(rgw, bucket_init_index, in, new 
 ClsBucketIndexOpCtxint(NULL, NULL, c, this));
   io_ctx.aio_operate(oid, c, op, NULL);
 }
 };
 
 
 int cls_rgw_bucket_index_init_op(librados::IoCtx io_ctx,
   const vectorstring bucket_objs, uint32_t max_aio)
 {
  vectorstring::const_iterator iter = bucket_objs.begin();
  bufferlist in;
  ClsBucketIndexAioThrottler* throttler = new 
 ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs);
  for (; iter != bucket_objs.end()  max_aio--  0; ++iter) {
  throttler-do_next();
  }
  throttler-wait_completion();
  return 0;
 }
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http

Re: bucket index sharding - IO throttle

2014-07-30 Thread Guang Yang
+ceph-devel.

Thanks,
Guang

On Jul 29, 2014, at 10:20 PM, Guang Yang yguan...@outlook.com wrote:

 Hi Yehuda,
 Per you review comment in terms of IO throttling for bucket index operation, 
 I prototyped the below code (details still need to polish), can you take a 
 look if that is right way to go?
 
 Another problem I came across is that ClsBucketIndexOpCtx::handle_compeltion 
 was not called for the bucket index init op (below), is there anything I 
 missed obviously here?
 
 Thanks,
 Guang
 
 
 class ClsBucketIndexAioThrottler {
 protected:
  int completed;
  int ret_code;
  IoCtx io_ctx;
  Mutex lock;
  struct LockCond {
Mutex lock;
Cond cond;
LockCond() : lock(LockCond), cond() {}
  } lock_cond;
 public:
  ClsBucketIndexAioThrottler(IoCtx _io_ctx)
: completed(0), ret_code(0), io_ctx(_io_ctx),
lock(ClsBucketIndexAioThrottler), lock_cond() {}
 
  virtual ~ClsBucketIndexAioThrottler() {}
  virtual void do_next() = 0;
  virtual bool is_completed () = 0;
 
  void complete(int ret) {
{
  Mutex::Locker l(lock);
  if (ret  0)
ret_code = ret;
  ++completed;
}
 
lock_cond.lock.Lock();
lock_cond.cond.Signal();
lock_cond.lock.Unlock();
  }
 
  int get_ret_code () {
Mutex::Locker l(lock);
return ret_code;
  }
 
  virtual int wait_completion() {
lock_cond.lock.Lock();
while (1) {
  if (is_completed()) {
lock_cond.lock.Unlock();
return ret_code;
  }
  lock_cond.cond.Wait(lock_cond.lock);
  lock_cond.lock.Lock();
}
  }
 };
 
 class ClsBucketIndexListAioThrottler : public ClsBucketIndexAioThrottler {
 protected:
  vectorstring bucket_objects;
  vectorstring::iterator iter_pos;
 public:
  ClsBucketIndexListAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs)
: ClsBucketIndexAioThrottler(_io_ctx), bucket_objects(_bucket_objs),
iter_pos(bucket_objects.begin()) {}
 
  virtual bool is_completed() {
Mutex::Locker l(lock);
int sent = 0;
vectorstring::iterator iter = bucket_objects.begin();
for (; iter != iter_pos; ++iter) ++sent;
 
return (sent == completed 
(iter_pos == bucket_objects.end() /*Success*/ || ret_code  0 
 /*Failure*/));
  }
 };
 
 templatetypename T
 class ClsBucketIndexOpCtx : public ObjectOperationCompletion {
 private:
  T* data;
  // Return code of the operation
  int* ret_code;
 
  // The Aio completion object associated with this Op, it should
  // be release from within the completion handler
  librados::AioCompletion* completion;
  ClsBucketIndexAioThrottler* throttler;
 public:
  ClsBucketIndexOpCtx(T* _data, int* _ret_code, librados::AioCompletion* 
 _completion,
  ClsBucketIndexAioThrottler* _throttler)
: data(_data), ret_code(_ret_code), completion(_completion), 
 throttler(_throttler) {}
  ~ClsBucketIndexOpCtx() {}
 
  // The completion callback, fill the response data
  void handle_completion(int r, bufferlist outbl) {
if (r = 0) {
  if (data) {
try {
  bufferlist::iterator iter = outbl.begin();
  ::decode((*data), iter);
} catch (buffer::error err) {
  r = -EIO;
}
  }
  // Do the next request
}
throttler-do_next();
throttler-complete(r);
if (completion) {
  completion-release();
}
  }
 };
 
 
 class ClsBucketIndexInitAioThrottler : public ClsBucketIndexListAioThrottler {
 public:
  ClsBucketIndexInitAioThrottler(IoCtx _io_ctx, const vectorstring 
 _bucket_objs) :
ClsBucketIndexListAioThrottler(_io_ctx, _bucket_objs) {}
 
  virtual void do_next() {
string oid;
{
  Mutex::Locker l(lock);
  if (iter_pos == bucket_objects.end())
return;
  oid = *(iter_pos++);
}
AioCompletion* c = librados::Rados::aio_create_completion(NULL, NULL, 
 NULL);
// Dummy
bufferlist in;
librados::ObjectWriteOperation op;
op.create(true);
op.exec(rgw, bucket_init_index, in, new ClsBucketIndexOpCtxint(NULL, 
 NULL, c, this));
io_ctx.aio_operate(oid, c, op, NULL); 
  }
 };
 
 
 int cls_rgw_bucket_index_init_op(librados::IoCtx io_ctx,
const vectorstring bucket_objs, uint32_t max_aio)
 {
   vectorstring::const_iterator iter = bucket_objs.begin();
   bufferlist in;
   ClsBucketIndexAioThrottler* throttler = new 
 ClsBucketIndexInitAioThrottler(io_ctx, bucket_objs);
   for (; iter != bucket_objs.end()  max_aio--  0; ++iter) {
   throttler-do_next();
   }
   throttler-wait_completion();
   return 0;
 }
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


row geo-replication to another data store?

2014-07-17 Thread Guang Yang
Hi cephers,
We are investigating a backup solution for Ceph, in short, we would like a 
solution to backup a Ceph cluster to another data store (not Ceph cluster, 
assume it has SWIFT API). We would like to have both full backup and 
incremental backup on top of the full backup.

After going through the geo-replication blueprint [1], I am thinking that we 
can leverage the effort and instead of replicate the data into another ceph 
cluster, we make it replicate to another data store. At the same time, I have a 
couple of questions which need your help:

1) How does the ragosgw-agent scale to multiple hosts? Our first investigation 
shows it only works on a single host but I would like to confirm.
2) Can we configure the interval  to do incremental backup like 1 hour / 1 day 
/ 1 month?

[1] 
https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EC pool - empty files in OSD from radosgw

2014-07-14 Thread Guang Yang
Hi Yehuda and Sam,
Any suggestion on top of the below issue?

Thanks,
Guang

On Jul 12, 2014, at 12:43 AM, Guang Yang yguan...@outlook.com wrote:

 Hi Loic,
 I opened an issue in terms of a change brought along with EC pool plus 
 radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we 
 observed a large number of empty files in OSD and the root cause is that for 
 head object from radosgw, there are a couple of transactions coming together, 
 including create 0~0, setxattr, writefull, as EC bring the concept of the 
 object generation, the create transaction will first create an object and the 
 following write full transition will be taken as an update and rename the 
 original empty file to a generation and create/write a new file. As a result, 
 we observed quite some empty files.
 
 There is an bug tracking the effort to remove those files with generation and 
 is pending to be back port to firefly, that could definitely help our use 
 case, however, I am also wondering if there is any room to improve for such 
 case so that those empty files would not be generated in the first place 
 (change might be at radosgw side).
 
 Any suggestion is welcomed.
 
 Thanks,
 Guang

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EC pool - empty files in OSD from radosgw

2014-07-11 Thread Guang Yang
Hi Loic,
I opened an issue in terms of a change brought along with EC pool plus radosgw 
(http://tracker.ceph.com/issues/8625), in our test cluster, we observed a large 
number of empty files in OSD and the root cause is that for head object from 
radosgw, there are a couple of transactions coming together, including create 
0~0, setxattr, writefull, as EC bring the concept of the object generation, the 
create transaction will first create an object and the following write full 
transition will be taken as an update and rename the original empty file to a 
generation and create/write a new file. As a result, we observed quite some 
empty files.

There is an bug tracking the effort to remove those files with generation and 
is pending to be back port to firefly, that could definitely help our use case, 
however, I am also wondering if there is any room to improve for such case so 
that those empty files would not be generated in the first place (change might 
be at radosgw side).

Any suggestion is welcomed.

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.80.2?

2014-07-10 Thread Guang Yang
Hi Sage,
Is it possible to include a fix for this bug - 
http://tracker.ceph.com/issues/8733 considering the scope of the change and 
regression risk for the next release? We are finalizing our production launch 
version and this one is a blocker as we use EC pool.

Thanks,
Guang

On Jul 11, 2014, at 7:31 AM, Sage Weil sw...@redhat.com wrote:

 We built v0.80.2 yesterday and pushed it out to the repos, but quickly 
 discovered a regression in radosgw that preventing reading objects written 
 with earlier versions.  We pulled the packages, fixed the bug, and are 
 rerunning tests to confirm the fix and ensure there aren't other 
 upgrade-related issues.  We expect to have a v0.80.3 ready tomorrow or 
 Monday.
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw - bucket index sharding merge back

2014-07-07 Thread Guang Yang
Hi Yehuda,
I am trying to find a way to merge back the bucket index sharding effort, and 
with more experience working at Ceph, I realized that the original commit was 
too huge which introduced trouble for review. I am thinking to break it down 
into multiple small commits and commit back with a number of patches. I have 
two questions here:

1) Have you looked at the patch and any suggestion I should pay attention to 
when doing the split?
2) Can we merge back with a series of patches (e.g. several commits one patch)?

Any suggestion that I should pay attention to so as to drive this effort into 
completion?

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw - bucket index sharding merge back

2014-07-07 Thread Guang Yang
Hi Yehuda,
I am trying to find a way to merge back the bucket index sharding effort, and 
with more experience working at Ceph, I realized that the original commit was 
too huge which introduced trouble for review. I am thinking to break it down 
into multiple small commits and commit back with a number of patches. I have 
two questions here:

1) Have you looked at the patch and any suggestion I should pay attention to 
when doing the split?
2) Can we merge back with a series of patches (e.g. several commits one patch)?

Any suggestion that I should pay attention to so as to drive this effort into 
completion?

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


XFS - number of files in a directory

2014-06-23 Thread Guang Yang
Hello Cephers,
We used to have a Ceph cluster and setup our data pool as 3 replicas, we 
estimated the number of files (given disk size and object size) for each PG was 
around 8K, we disabled folder splitting which mean all files located at the 
root PG folder. Our testing showed a good performance with such setup.

Right now we are evaluating erasure coding, which split the object into a 
number of chunks and increase the number of files several times, although XFS 
claims a good support for large directories [1], some testing also showed that 
we may expect performance degradation for large directories.

I would like to check with your experience on top of this for your Ceph cluster 
if you are using XFS. Thanks.

[1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CDS G/H - bucket index sharding

2014-06-23 Thread Guang Yang
Thanks Yehuda, my comments inline...
On Jun 23, 2014, at 10:44 PM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Mon, Jun 23, 2014 at 4:11 AM, Guang Yang yguan...@outlook.com wrote:
 Hello Yehuda,
 I drafted a brief summary for the status of the bucket index sharding 
 blueprint and put it here - 
 http://pad.ceph.com/p/GH-bucket-index-scalability, it would be nice you can 
 take a look to see if there is anything I missed, I also posted the pull 
 request here - https://github.com/ceph/ceph/pull/2013.
 
 Just one note regarding the blueprint, other BI log operations will
 need to use the new schema too (e.g., log trim operations).
Yeah, that has been implemented, thanks for pointing it out.
 
 I was thinking a bit about how to do resizing and dynamic sharding
 later on. My thought was that we'd have two bucket prefixes: one for
 read and delete operations, and one for read, write and delete
 operations. Normally both will point at the same prefix and we'll just
 access a single one. But when we're resizing we'll need to use both.
 If we're listing objects we'll access both sets of shards and merge
 everything). If we're creating object we'll just create it in the
 second one. Removing object, we'll remove it from both.
 The above description is a bit vague, and shouldn't really change what
 we do now. Just that the implementation needs to maybe abstract that
 bucket access decision nicely so that in the future we could implement
 this easily.
Considering the tradeoff we have with multiple shards for bucket index object, 
we are not likely to create a large number of shards (except we add something 
like per shard listing), thus it might make sense to start with the upper bound 
directly (e.g. 50), it might be good enough for most use cases. Another 
direction we may explore is to let user specify the number of shards (e.g. via 
user defined metadata), when he/she has an estimation of the number of objects 
for a bucket.

As for dynamic bucket, I think there are two options, one is with no data 
migration when changing the number of shards (thus there might be multiple 
version of truth), another is to have data migration. The approach mentioned 
above is the first one, we should be able to implement the above approach with 
some aggregation at client side for multiple version of truth.
 
 Sadly I'll be off for this CDS, but I'm sure Josh, Greg, Sage, and
 others will be able to help there.
 
 Thanks,
 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changes of scrubbing?

2014-06-14 Thread Guang Yang

On Jun 11, 2014, at 10:02 PM, Gregory Farnum g...@inktank.com wrote:

 On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang yguan...@outlook.com wrote:
 On Jun 11, 2014, at 6:33 AM, Gregory Farnum g...@inktank.com wrote:
 
 On Tue, May 20, 2014 at 6:44 PM, Guang Yang yguan...@outlook.com wrote:
 Hi ceph-devel,
 Like some users of Ceph, we are using Ceph for a latency sensitive 
 project, and scrubbing (especially deep-scrubbing) impacts the SLA in a 
 non-trivial way, as commodity hardware could fail in one way or the other, 
 I think it is essential to have scrubbing enabled to preserve data 
 durability.
 
 Inspired by how erasure coding backend implement scrubbing[1], I am 
 wondering if the following changes is valid to somehow reduce the 
 performance impact from scrubbing:
 1. Store the CRC checksum along with each physical copy of the object on 
 filesystem (via xattr or omap?)
 2. For read request, it checks the CRC locally and if it mismatch, 
 redirect the request to a replica and mark the PG as inconsistent.
 
 The problem with this is that you need to maintain the CRC across
 partial overwrites of the object. And the real cost of scrubbing isn't
 in the network traffic, it's in the disk reads, which you would have
 to do anyway with this method. :)
 Thanks Greg for the response!
 Partial update is the right concern if that happens frequently. However, the 
 major benefit of this proposal is to postpone the CRC check to READ request 
 instead of doing it from within a background job (although we may still need 
 to do background check as deep-scrubbing, we can reduce the frequency 
 dramatically). By checking the CRC at read time, in-consistent object are 
 marked as inconsistent (PG) and further we can trigger a repair for the PG.
 
 Oh, I see.
 Still, partial update is in fact the major concern. We have a debug
 mechanism called sloppy crc or similar that keeps track of them for
 full (or sufficiently large?) writes, but it's not something you can
 use on production cluster because it turns every write into a
 read-modify-write cycle, and that's just prohibitively expensive (in
 addition to issues with stuff like OSD restart, I think). This sort of
 thing would make sense for the erasure-coded pools; maybe that would
 be a better place to start?
Yeah, that sounds like a good starting point, let me see if I can spend some 
time doing a simple POC.
Thanks Greg.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Changes of scrubbing?

2014-06-11 Thread Guang Yang
On Jun 11, 2014, at 6:33 AM, Gregory Farnum g...@inktank.com wrote:

 On Tue, May 20, 2014 at 6:44 PM, Guang Yang yguan...@outlook.com wrote:
 Hi ceph-devel,
 Like some users of Ceph, we are using Ceph for a latency sensitive project, 
 and scrubbing (especially deep-scrubbing) impacts the SLA in a non-trivial 
 way, as commodity hardware could fail in one way or the other, I think it is 
 essential to have scrubbing enabled to preserve data durability.
 
 Inspired by how erasure coding backend implement scrubbing[1], I am 
 wondering if the following changes is valid to somehow reduce the 
 performance impact from scrubbing:
 1. Store the CRC checksum along with each physical copy of the object on 
 filesystem (via xattr or omap?)
 2. For read request, it checks the CRC locally and if it mismatch, redirect 
 the request to a replica and mark the PG as inconsistent.
 
 The problem with this is that you need to maintain the CRC across
 partial overwrites of the object. And the real cost of scrubbing isn't
 in the network traffic, it's in the disk reads, which you would have
 to do anyway with this method. :)
Thanks Greg for the response!
Partial update is the right concern if that happens frequently. However, the 
major benefit of this proposal is to postpone the CRC check to READ request 
instead of doing it from within a background job (although we may still need to 
do background check as deep-scrubbing, we can reduce the frequency 
dramatically). By checking the CRC at read time, in-consistent object are 
marked as inconsistent (PG) and further we can trigger a repair for the PG.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 This is just a general idea and details (like append) will need to further 
 discussed.
 
 By having such, we can schedule the scrubbing less aggresively but still 
 preserve the durability for read.
 
 Does this make some sense?
 
 [1] http://ceph.com/docs/master/dev/osd_internals/erasure_coding/pgbackend/
 
 Thanks,
 Guang Yang--
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-06-06 Thread Guang Yang
Hi Yehuda,
Can you take a look at a very high level of the code change, here is the pull 
request - https://github.com/ceph/ceph/pull/1929.

If things look good to you, i will continue the effort and make it more 
clear/complete by end of next week.

Thanks,
Guang

On Jun 2, 2014, at 9:37 PM, Guang Yang yguan...@outlook.com wrote:

 Hi Yehuda and Sage,
 Can you help to comment on the ticket, I would like to send out a pull 
 request some time this week for you to review, but before that, it would be 
 nice to see your comments in terms of the interface and any other concerns 
 you may have for this. Thanks.
 
 Thanks,
 Guang
 
 
 On May 30, 2014, at 8:35 AM, Guang Yang yguan...@outlook.com wrote:
 
 Hi Yehuda,
 I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
 review and comment.
 
 Thanks,
 Guang
 
 On May 19, 2014, at 2:47 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 
 On Sun, May 18, 2014 at 11:18 PM, Guang Yang yguan...@outlook.com wrote:
 On May 19, 2014, at 7:05 AM, Sage Weil s...@inktank.com wrote:
 
 On Sun, 18 May 2014, Guang wrote:
 radosgw is using the omap key/value API for objects, which is more or 
 less
 equivalent to what swift is doing with sqlite.  This data passes 
 straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of 
 OSDs
 with the better k/v backend.
 Not sure if I miss anything, but the key difference with SWIFT?s
 implementation is that they are using a table for bucket index and it
 actually can be updated in parallel which makes more scalable for write,
 though at certain point the sql table would result in performance
 degradation as well.
 
 As I understand it the same limitation is present there too: the index is
 in a single sqlite table.
 
 My more well-formed opinion is that we need to come up with a good
 design. It needs to be flexible enough to be able to grow (and maybe
 shrink), and I assume there would be some kind of background operation
 that will enable that. I also believe that making it hash based is the
 way to go. It looks like that the more complicated issue is here is
 how to handle the transition in which we shard buckets.
 Yeah I agree. I think the conflicting goals here are, we want a sorted
 list (so that it enable prefix scan for listing purpose) and we want to
 shard at the very beginning (the problem we are facing is parallel
 writes updating the same bucket index object will need to be
 serialized).
 
 Given how infrequent container listings are, pre-sharding containers
 across several objects makes some sense.  Paying the cost of doing
 listings in parallel across N (where N is not too big) is not a big price
 to pay. However, there will always need to be a way to re-shard further
 when containers/buckets get extremely big.  Perhaps a starting point would
 be support for static sharding where the number of shards is specified at
 container/bucket creation time…
 Considering the scope of the change, I also think this is a good starting 
 point to make the bucket index updating more scalable.
 Yehuda,
 How do you think?
 
 Sharding it will help with scaling it up to a certain point. As Sage
 mentioned we can start with a static setting as a first simpler
 approach, and move into a dynamic approach later on.
 
 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-06-02 Thread Guang Yang
Hi Yehuda and Sage,
Can you help to comment on the ticket, I would like to send out a pull request 
some time this week for you to review, but before that, it would be nice to see 
your comments in terms of the interface and any other concerns you may have for 
this. Thanks.

Thanks,
Guang


On May 30, 2014, at 8:35 AM, Guang Yang yguan...@outlook.com wrote:

 Hi Yehuda,
 I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
 review and comment.
 
 Thanks,
 Guang
 
 On May 19, 2014, at 2:47 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 
 On Sun, May 18, 2014 at 11:18 PM, Guang Yang yguan...@outlook.com wrote:
 On May 19, 2014, at 7:05 AM, Sage Weil s...@inktank.com wrote:
 
 On Sun, 18 May 2014, Guang wrote:
 radosgw is using the omap key/value API for objects, which is more or 
 less
 equivalent to what swift is doing with sqlite.  This data passes 
 straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of 
 OSDs
 with the better k/v backend.
 Not sure if I miss anything, but the key difference with SWIFT?s
 implementation is that they are using a table for bucket index and it
 actually can be updated in parallel which makes more scalable for write,
 though at certain point the sql table would result in performance
 degradation as well.
 
 As I understand it the same limitation is present there too: the index is
 in a single sqlite table.
 
 My more well-formed opinion is that we need to come up with a good
 design. It needs to be flexible enough to be able to grow (and maybe
 shrink), and I assume there would be some kind of background operation
 that will enable that. I also believe that making it hash based is the
 way to go. It looks like that the more complicated issue is here is
 how to handle the transition in which we shard buckets.
 Yeah I agree. I think the conflicting goals here are, we want a sorted
 list (so that it enable prefix scan for listing purpose) and we want to
 shard at the very beginning (the problem we are facing is parallel
 writes updating the same bucket index object will need to be
 serialized).
 
 Given how infrequent container listings are, pre-sharding containers
 across several objects makes some sense.  Paying the cost of doing
 listings in parallel across N (where N is not too big) is not a big price
 to pay. However, there will always need to be a way to re-shard further
 when containers/buckets get extremely big.  Perhaps a starting point would
 be support for static sharding where the number of shards is specified at
 container/bucket creation time…
 Considering the scope of the change, I also think this is a good starting 
 point to make the bucket index updating more scalable.
 Yehuda,
 How do you think?
 
 Sharding it will help with scaling it up to a certain point. As Sage
 mentioned we can start with a static setting as a first simpler
 approach, and move into a dynamic approach later on.
 
 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-05-29 Thread Guang Yang
Hi Yehuda,
I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
review and comment.

Thanks,
Guang

On May 19, 2014, at 2:47 PM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Sun, May 18, 2014 at 11:18 PM, Guang Yang yguan...@outlook.com wrote:
 On May 19, 2014, at 7:05 AM, Sage Weil s...@inktank.com wrote:
 
 On Sun, 18 May 2014, Guang wrote:
 radosgw is using the omap key/value API for objects, which is more or 
 less
 equivalent to what swift is doing with sqlite.  This data passes straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of OSDs
 with the better k/v backend.
 Not sure if I miss anything, but the key difference with SWIFT?s
 implementation is that they are using a table for bucket index and it
 actually can be updated in parallel which makes more scalable for write,
 though at certain point the sql table would result in performance
 degradation as well.
 
 As I understand it the same limitation is present there too: the index is
 in a single sqlite table.
 
 My more well-formed opinion is that we need to come up with a good
 design. It needs to be flexible enough to be able to grow (and maybe
 shrink), and I assume there would be some kind of background operation
 that will enable that. I also believe that making it hash based is the
 way to go. It looks like that the more complicated issue is here is
 how to handle the transition in which we shard buckets.
 Yeah I agree. I think the conflicting goals here are, we want a sorted
 list (so that it enable prefix scan for listing purpose) and we want to
 shard at the very beginning (the problem we are facing is parallel
 writes updating the same bucket index object will need to be
 serialized).
 
 Given how infrequent container listings are, pre-sharding containers
 across several objects makes some sense.  Paying the cost of doing
 listings in parallel across N (where N is not too big) is not a big price
 to pay. However, there will always need to be a way to re-shard further
 when containers/buckets get extremely big.  Perhaps a starting point would
 be support for static sharding where the number of shards is specified at
 container/bucket creation time…
 Considering the scope of the change, I also think this is a good starting 
 point to make the bucket index updating more scalable.
 Yehuda,
 How do you think?
 
 Sharding it will help with scaling it up to a certain point. As Sage
 mentioned we can start with a static setting as a first simpler
 approach, and move into a dynamic approach later on.
 
 Yehuda
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: erasure code reliability model

2014-05-26 Thread Guang Yang
Hello Loic and Koleos,
Do we have any wiki page documenting the progress and report of this effort? We 
are very interested in such as well.

Thanks,
Guang

On May 8, 2014, at 1:19 AM, Loic Dachary l...@dachary.org wrote:

 
 Hi,
 
 On 07/05/2014 18:43, Koleos Fuskus wrote: Hi Loic,
 
 What do you mean by action plan? if is the schedule, it is published on my 
 proposal in melange site. Indeed, the details of my proposal are private, if 
 is that what you mean, I can added on the wiki. If you want this part to be 
 public no problem. 
 
 Yes, that's what I meant :-) You could collect snippets from the proposal 
 posted on melange and present them on a subpage 
 https://wiki.ceph.com/Development and we can use this as the home page of 
 your work ? 
 
 Actually I am using some of my free time on reading ceph documentation (web 
 and papers). Do you have any specific document to recommend me? Maybe we can 
 discuss again about the durability model on Friday/Monday.  I need some more 
 understanding about the ceph architecture.
 
 I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to 
 chat during this community bonding phase (I don't remember how google calls 
 this for GSoC participants ;-)
 
 Cheers
 
 
 Cheers,
 Verónica
 On Wednesday, May 7, 2014 8:12 AM, Loic Dachary l...@dachary.org wrote:
 Hi Veronica,
 
 I was really happy to hear you're going to work on the erasure code 
 reliability model. Unless I'm mistaken your action plan was not published. 
 Would you mind adding it to the wiki so I can comment on it ? Someone else 
 might be interested to contribute too. I've had a discussion yesterday about 
 durability models (internally) and it is not well understood. Your insight 
 would be precious.
 
 Cheers
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 
 
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: erasure code reliability model

2014-05-26 Thread Guang Yang
Thanks Koleos!

Guang
On May 26, 2014, at 7:02 PM, Koleos Fuskus koleosfus...@yahoo.com wrote:

 Hello Guang,
 
 Here is the wiki: 
 https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_modeling
 
 Koleos
 
 
 
 On Monday, May 26, 2014 10:05 AM, Guang Yang yguan...@outlook.com wrote:
 Hello Loic and Koleos,
 Do we have any wiki page documenting the progress and report of this effort? 
 We are very interested in such as well.
 
 Thanks,
 Guang
 
 
 
 
 On May 8, 2014, at 1:19 AM, Loic Dachary l...@dachary.org wrote:
 
 
 Hi,
 
 On 07/05/2014 18:43, Koleos Fuskus wrote: Hi Loic,
 
 What do you mean by action plan? if is the schedule, it is published on my 
 proposal in melange site. Indeed, the details of my proposal are private, 
 if is that what you mean, I can added on the wiki. If you want this part to 
 be public no problem. 
 
 Yes, that's what I meant :-) You could collect snippets from the proposal 
 posted on melange and present them on a subpage 
 https://wiki.ceph.com/Development and we can use this as the home page of 
 your work ? 
 
 Actually I am using some of my free time on reading ceph documentation (web 
 and papers). Do you have any specific document to recommend me? Maybe we 
 can discuss again about the durability model on Friday/Monday.  I need some 
 more understanding about the ceph architecture.
 
 I'm connected on irc.oftc.net#ceph-devel today and tomorrow if you'd like to 
 chat during this community bonding phase (I don't remember how google 
 calls this for GSoC participants ;-)
 
 Cheers
 
 
 Cheers,
 Verónica
 On Wednesday, May 7, 2014 8:12 AM, Loic Dachary l...@dachary.org wrote:
 Hi Veronica,
 
 I was really happy to hear you're going to work on the erasure code 
 reliability model. Unless I'm mistaken your action plan was not published. 
 Would you mind adding it to the wiki so I can comment on it ? Someone else 
 might be interested to contribute too. I've had a discussion yesterday 
 about durability models (internally) and it is not well understood. Your 
 insight would be precious.
 
 Cheers
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 
 
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions of KeyValueStore (leveldb) backend

2014-05-25 Thread Guang Yang
Hello Haomai,
We are evaluating the key-value store backend which comes along with Firefly 
release (thanks for implementing it in Ceph), it is very promising for a couple 
of our use cases, after going through the related code change, I have a couple 
of questions which needs your help:
  1. One observation is that, for object larger than 1KB, it will be striped to 
multiple chunks (k-v in the leveldb table), with one strip as 1KB size. Is 
there any particular reason we choose 1KB as the strip size (and I didn’t find 
a configuration to tune this value)?

  2. This is properly a leveldb question, do we expect performance degradation 
as the leveldb instance keeps increasing (e.g. several TB)?

Thanks,
Guang--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions of KeyValueStore (leveldb) backend

2014-05-25 Thread Guang Yang
Thanks!
On May 26, 2014, at 12:55 PM, Haomai Wang haomaiw...@gmail.com wrote:

 On Mon, May 26, 2014 at 9:46 AM, Guang Yang yguan...@outlook.com wrote:
 Hello Haomai,
 We are evaluating the key-value store backend which comes along with Firefly 
 release (thanks for implementing it in Ceph), it is very promising for a 
 couple of our use cases, after going through the related code change, I have 
 a couple of questions which needs your help:
  1. One observation is that, for object larger than 1KB, it will be striped 
 to multiple chunks (k-v in the leveldb table), with one strip as 1KB size. 
 Is there any particular reason we choose 1KB as the strip size (and I didn’t 
 find a configuration to tune this value)?
 
 1KB is not a serious value, this value can be configured in the near future.
 
 
  2. This is properly a leveldb question, do we expect performance 
 degradation as the leveldb instance keeps increasing (e.g. several TB)?
 
 Ceph OSD is expected to own a physical disk, normally is several
 TB(1-4TB). LevelDB can take it easy. Especially we use it to store
 large value(compared to normal application usage).
 
 
 Thanks,
 Guang
 
 
 
 -- 
 Best Regards,
 
 Wheat
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw - bucket index

2014-05-19 Thread Guang Yang
On May 19, 2014, at 7:05 AM, Sage Weil s...@inktank.com wrote:

 On Sun, 18 May 2014, Guang wrote:
 radosgw is using the omap key/value API for objects, which is more or less
 equivalent to what swift is doing with sqlite.  This data passes straight
 into leveldb on the backend (or whatever other backend you are using).
 Using something like rocksdb in its place is pretty simple and ther are
 unmerged patches to do that; the user would just need to adjust their
 crush map so that the rgw index pool is mapped to a different set of OSDs
 with the better k/v backend.
 Not sure if I miss anything, but the key difference with SWIFT?s 
 implementation is that they are using a table for bucket index and it 
 actually can be updated in parallel which makes more scalable for write, 
 though at certain point the sql table would result in performance 
 degradation as well.
 
 As I understand it the same limitation is present there too: the index is 
 in a single sqlite table.
 
 My more well-formed opinion is that we need to come up with a good
 design. It needs to be flexible enough to be able to grow (and maybe
 shrink), and I assume there would be some kind of background operation
 that will enable that. I also believe that making it hash based is the
 way to go. It looks like that the more complicated issue is here is
 how to handle the transition in which we shard buckets.
 Yeah I agree. I think the conflicting goals here are, we want a sorted 
 list (so that it enable prefix scan for listing purpose) and we want to 
 shard at the very beginning (the problem we are facing is parallel 
 writes updating the same bucket index object will need to be 
 serialized).
 
 Given how infrequent container listings are, pre-sharding containers 
 across several objects makes some sense.  Paying the cost of doing 
 listings in parallel across N (where N is not too big) is not a big price 
 to pay. However, there will always need to be a way to re-shard further 
 when containers/buckets get extremely big.  Perhaps a starting point would 
 be support for static sharding where the number of shards is specified at 
 container/bucket creation time…
Considering the scope of the change, I also think this is a good starting point 
to make the bucket index updating more scalable.
Yehuda,
How do you think?
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html