Re: wip-memstore and wip-objectstore

2014-07-22 Thread Haomai Wang
Hi sage,

The fix is https://github.com/ceph/ceph/pull/2136. :-)

On Wed, Jul 23, 2014 at 12:48 AM, Haomai Wang  wrote:
> Thanks, I will dive into it and fix it next.
>
> On Tue, Jul 22, 2014 at 11:49 PM, Sage Weil  wrote:
>> Hi Haomai,
>>
>> Hmm, one other thing: I'm testing the fix in wip-8701 and it is tripping
>> over the KeyValueStore test.  This
>>
>>  ./ceph_test_objectstore 
>> --gtest_filter=ObjectStore/StoreTest.BigRGWObjectName/1
>>
>> fails with
>>
>>  0> 2014-07-22 08:45:25.640932 7fe617fff700 -1 *** Caught signal 
>> (Segmentation fault) **
>>  in thread 7fe617fff700
>>
>>  ceph version 0.82-649-gc5732e4 (c5732e4aefbd80f29b766756478d79808f0245d7)
>>  1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa09a7b]
>>  2: ./ceph_test_objectstore() [0xb145b6]
>>  3: (()+0x10340) [0x7fe621313340]
>>  4: (std::pair::pair(unsigned long const&, 
>> ghobject_t const&)+0x18) [0xadb710]
>>  5: (std::map, 
>> std::allocator > 
>> >::operator[](unsigned long const&)+0x109) [0xad957f]
>>  6: (RandomCache> std::tr1::shared_ptr > 
>> >::trim_cache(unsigned long)+0xc2) [0xad7218]
>>  7: (RandomCache> std::tr1::shared_ptr > >::add(ghobject_t, 
>> std::pair 
>> >)+0x70) [0xad53a2]
>>  8: (StripObjectMap::lookup_strip_header(coll_t const&, ghobject_t const&, 
>> std::tr1::shared_ptr*)+0x4d3) [0xab1fb1]
>>  9: (KeyValueStore::BufferTransaction::lookup_cached_header(coll_t const&, 
>> ghobject_t const&, std::tr1::shared_ptr*, 
>> bool)+0x1dc) [0xab32c0]
>>  10: (KeyValueStore::_remove(coll_t, ghobject_t const&, 
>> KeyValueStore::BufferTransaction&)+0x188) [0xac0472]
>>  11: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, 
>> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0x632) [0xabba0c]
>>  12: (KeyValueStore::_do_transactions(std::list> std::allocator >&, unsigned long, 
>> ThreadPool::TPHandle*)+0x138) [0xabb2ee]
>>  13: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, 
>> ThreadPool::TPHandle&)+0x1f5) [0xabac11]
>>  14: (KeyValueStore::OpWQ::_process(KeyValueStore::OpSequencer*, 
>> ThreadPool::TPHandle&)+0x2f) [0xad3f23]
>>  15: 
>> (ThreadPool::WorkQueue::_void_process(void*, 
>> ThreadPool::TPHandle&)+0x33) [0xadf645]
>>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0xb28f7c]
>>  17: (ThreadPool::WorkThread::entry()+0x23) [0xb2d031]
>>  18: (Thread::entry_wrapper()+0x79) [0xb21647]
>>  19: (Thread::_entry_func(void*)+0x18) [0xb215c4]
>>  20: (()+0x8182) [0x7fe62130b182]
>>  21: (clone()+0x6d) [0x7fe61fc8330d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>>
>> It's a new test (long file names and collection_move) that addresses an
>> issue with the FileStore, but KeyValueStore doesn't seem to like it
>> either...
>>
>> Thanks!
>> sage
>>
>>
>> On Tue, 22 Jul 2014, Sage Weil wrote:
>>
>>> Hi Haomai,
>>>
>>> Do you mind looking at wip-memstore at
>>>
>>>   https://github.com/ceph/ceph/pull/2125
>>>
>>> A couple minor fixes and then we can enable it in ceph_test_objectstore.
>>>
>>> Also, I would love any feedback on wip-objectstore
>>>
>>>   https://github.com/ceph/ceph/pull/2124
>>>
>>> That one is RFC at this point.  I'm trying to simplify the ObjectStore
>>> interface as much as possible.
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Gregory Farnum
On Tue, Jul 22, 2014 at 6:19 AM, Wido den Hollander  wrote:
> Hi,
>
> Currently on Ubuntu with Upstart when you invoke a restart like this:
>
> $ sudo restart ceph-osd-all
>
> It will restart all OSDs at once, which can increase the load on the system
> a quite a bit.
>
> It's better to restart all OSDs by restarting them one by one:
>
> $ sudo ceph restart ceph-osd id=X
>
> But you then have to figure out all the IDs by doing a find in
> /var/lib/ceph/osd and that's more manual work.
>
> I'm thinking of patching the init scripts which allows something like this:
>
> $ sudo restart ceph-osd-all delay=180
>
> It then waits 180 seconds between each OSD restart making the proces even
> smoother.
>
> I know there are currently sysvinit, upstart and systemd scripts, so it has
> to be implemented on various places, but how does the general idea sound?

That sounds like a good idea to me. I presume you're meaning to
actually delay the restarts, not just turning them on, so that the
daemons all remain alive (that's what it sounds like to me here, just
wanted to clarify).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Cache tiering read-proxy mode

2014-07-22 Thread Alex Elsayed
Sage Weil wrote:

> [Adding ceph-devel]
> 
> On Mon, 21 Jul 2014, Wang, Zhiqiang wrote:
>> Sage,
>> 
>> I agree with you that promotion on the 2nd read could improve cache
>> tiering's performance for some kinds of workloads. The general idea here
>> is to implement some kinds of policies in the cache tier to measure the
>> warmness of the data. If the cache tier is aware of the data warmness,
>> it could even initiate data movement between the cache tier and the base
>> tier. This means data could be prefetched into the cache tier before
>> reading or writing. But I think this is something we could do in the
>> future.
> 
> Yeah. I suspect it will be challenging to put this sort of prefetching
> intelligence directly into the OSDs, though.  It could possibly be done by
> an external agent, maybe, or could be driven by explicit hints from
> clients ("I will probably access this data soon").
> 
>> The 'promotion on 2nd read' policy is straightforward. Sure it will
>> benefit some kinds of workload, but not all. If it is implemented as a
>> cache tier option, the user needs to decide to turn it on or not. But
>> I'm afraid most of the users don't have the idea of this. This increases
>> the difficulty of using cache tiering.
> 
> I suspect the 2nd read behavior will be something we'll want to do by
> default...  but yeah, there will be a new pool option (or options) that
> controls the behavior.
> 
>> One question for the implementation of 'promotion on 2nd read': what do
>> we do for the 1st read? Does the cache tier read the object from base
>> tier but not doing replication, or just redirecting it?
> 
> For the first read, we just redirect the client.  The on the second read,
> we call promote_object().  See maybe_handle_cache() in ReplicatedPG.cc.
> We can pretty easily tell the difference by checking the in-memory HitSet
> for a match.
> 
> Perhaps the option in the pool would be something like
> min_read_recency_for_promote?  If we measure "recency" as "(avg) seconds
> since last access" (loosely), 0 would mean it would promote on first read,
> and anything <= the HitSet interval would mean promote if the object is in
> the current HitSet.  > than that would mean we'd need to keep additional
> previous HitSets in RAM.
> 
> ...which leads us to a separate question of how to describe access
> frequency vs recency.  We keep N HitSets, each covering a time period of T
> seconds.  Normally we only keep the most recent HitSet in memory, unless
> the agent is active (flushing data).  So what I described above is
> checking how recently the last access was (within how many multiples of T
> seconds).  Additionally, though, we could describe the frequency of
> access: was the object accesssed at least once in every N interval of T
> seconds?  Or some fraction of them?  That is probably best described as
> "temperature?"  I'm not to fond of the term "recency," tho I can't
> think of anything better right now.
> 
> Anyway, for the read promote behavior, recency is probably sufficient, but
> for the tiering agent flush/evict behavior temperature might be a good
> thing to consider...
> 
> sage

It might be worth looking at the MQ (Multi-Queue) caching policy[1], which 
was explicitly designed for second-level caches (which applies here) - the 
client is very likely to be doing caching, whether they use CephFS 
(FSCache), RBD (client caching), or RADOS (application-level); that causes 
some interesting changes in terms of the statistical behavior of the second-
level cache.

[1] 
https://www.usenix.org/legacy/event/usenix01/full_papers/zhou/zhou_html/node9.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Forcing Ceph into mapping all objects to a single PG

2014-07-22 Thread Alex Elsayed
Gah, typed "fletcher4" when I meant "rjenkins" - still, the same applies.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Forcing Ceph into mapping all objects to a single PG

2014-07-22 Thread Alex Elsayed
Gregory Farnum wrote:

> On Mon, Jul 21, 2014 at 3:27 PM, Daniel Hofmann  wrote:
>> Preamble: you might want to read the decent formatted version of this
>> mail at:
>>> https://gist.github.com/daniel-j-h/2daae2237bb21596c97d

>> ---
>>
>> Ceph's object mapping depends on the rjenkins hash function.
>> It's possible to force Ceph into mapping all objects to a single PG.
>>
>> Please discuss!
> 
> Yes, this is an attack vector. It functions against...well, any system
> using hash-based placement.

Sort of. How well it functions is a function (heh) of how easy it is to find 
a preimage against the hash (collision only allows a pair, you need 
preimages to get beyond that).

With fletcher4, preimages aren't particularly difficult to find. By using a 
more robust hash[1], then preimages become more computationally expensive 
since you need to brute-force for each value rather than taking advantage of 
a weakness in the algorithm.

This doesn't buy a huge amount since the bruteforce effort per iteration is 
still bounded by the number of PGs, but it does help - and it means that as 
PGs are split, resistance to the attack increases as well.

> RGW mangles names on its own, although the mangling is deterministic
> enough that an attacker could perhaps manipulate it into mangling them
> onto the same PG. (Within the constraints, though, it'd be pretty
> difficult.)
> RBD names objects in a way that users can't really control, so I guess
> it's safe, sort of? (But users of rbd will still have write permission
> to some class of objects in which they may be able to find an attack.)
> 
> The real issue though, is that any user with permission to write to
> *any* set of objects directly in the cluster will be able to exploit
> this regardless of what barriers we erect. Deterministic placement, in
> that anybody directly accessing the cluster can compute data
> locations, is central to Ceph's design. We could add "salts" or
> something to try and prevent attackers from *outside* the direct set
> (eg, users of RGW) exploiting it directly, but anybody who can read or
> write from the cluster would need to be able to read the salt in order
> to compute locations themselves.

Actually, doing (say) per-pool salts does help in a notable way: even 
someone who can write to two pools can't reuse the computation of colliding 
values across pools. It forces them to expend the work factor for each pool 
they attack, rather than being able to amortize.

> So I'm pretty sure this attack vector
> is:
> 1) Inherent to all hash-placement systems,
> 2) not something we can reasonably defend against *anyway*.

I'd agree that in the absolute sense it's inherent and insoluble, but that 
doesn't imply that _mitigations_ are worthless.

A more drastic option would be to look at how the sfq network scheduler 
handles it - it hashes flows onto a fixed number of queues, and gets around 
collisions by periodically perturbing the salt (resulting in a _stochastic_ 
avoidance of clumping). It'd definitely require some research to find a way 
to do this such that it doesn't cause huge data movement, but it might be 
worth thinking about for the longer term.

[1] I'm thinking along the lines of SipHash, not any heavy-weight 
cryptographic hash; however with network latencies on the table those might 
not be too bad regardless

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disabling CRUSH for erasure code and doing custom placement

2014-07-22 Thread Shayan Saeed
Another question along the same lines. For erasure code, same as
replicated files, the request goes through the primary member. Isn't
it possible to send the request to any of the members and get the
file. While this might have kept things neater on the development side
and might have made some sense for replicated system, it makes the
availability and load balancing worse for erasure coded files. I see a
lot of requests coming in for a specific object which makes the
primary osd hosting it go down sometimes and then all the requests
have to wait another osd comes up and repair is done.  For load
balancing purposes, is there a way to make the requests go to someone
else without hinderance and get the object without waiting for repair.

Thanks,
Shayan
Regards,
Shayan Saeed


On Tue, Jul 15, 2014 at 1:18 PM, Gregory Farnum  wrote:
> One of Ceph's design tentpoles is *avoiding* a central metadata lookup
> table. The Ceph MDS maintains a filesystem hierarchy but doesn't
> really handle the sort of thing you're talking about, either. If you
> want some kind of lookup, you'll need to build it yourself — although
> you could make use of some RADOS features to do it, if you really
> wanted to. (For instance, depending on scale you could keep an index
> of objects in an omap somewhere.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Jul 15, 2014 at 10:11 AM, Shayan Saeed  
> wrote:
>> Well I did end up putting the data in different pools for custom
>> placement. However, I run into trouble during retrieval. The messy way
>> is to query every pool to check where the data is stored. This
>> requires many round trips to machines in the far off racks. Is it
>> possible this information is contained within a centralized sort of
>> metadata server? I understand that for simple object store MDS is not
>> used but is there a way to utilize it for faster querying?
>>
>> Regards,
>> Shayan Saeed
>>
>>
>> On Tue, Jun 24, 2014 at 11:37 AM, Gregory Farnum  wrote:
>>> On Tue, Jun 24, 2014 at 8:29 AM, Shayan Saeed  
>>> wrote:
 Hi,

 CRUSH placement algorithm works really nice with replication. However,
 with erasure code, my cluster has some issues which require making
 changes that I cannot specify with CRUSH maps.
 Sometimes, depending on the type of data, I would like to place them
 on different OSDs but in the same pool.
>>>
>>> Why do you want to keep the data in the same pool?
>>>

 I realize that to disable the CRUSH placement algorithm and replacing
 it with my own custom algorithm, such as random placement algo or any
 other, I have to make changes in the source code. I want to ask if
 there is an easy way to do this without going into every code file and
 looking where the mapping from objects to PG is done and changing
 that. Is there some configuration option which disables crush and
 points to my own placement algo file for doing custom placement.
>>>
>>> What you're asking for really doesn't sound feasible, but the thing
>>> that comes closest would probably be resurrecting the "pg preferred"
>>> mechanisms in CRUSH and the Ceph codebase. You'll have to go back
>>> through the git history to find it, but once upon a time we supported
>>> a mechanism that let you specify a specific OSD you wanted a
>>> particular object to live on, and then it would place the remaining
>>> replicas using CRUSH.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>

 Let me know about the most neat way to go about it. Appreciate any
 help I can get.

 Regards,
 Shayan Saeed
 Research Assistant, Systems Research Lab
 University of Illinois Urbana-Champaign
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: Use kmem_cache_free

2014-07-22 Thread Ilya Dryomov
On Tue, Jul 22, 2014 at 10:11 PM, Himangi Saraogi  wrote:
> Free memory allocated using kmem_cache_zalloc using kmem_cache_free
> rather than kfree.
>
> The Coccinelle semantic patch that makes this change is as follows:
>
> // 
> @@
> expression x,E,c;
> @@
>
>  x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
>  ... when != x = E
>  when != &x
> ?-kfree(x)
> +kmem_cache_free(c,x)
> // 
>
> Signed-off-by: Himangi Saraogi 
> Acked-by: Julia Lawall 
> ---
>  drivers/block/rbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index b2c98c1..8381c54 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1158,7 +1158,7 @@ static const char *rbd_segment_name(struct rbd_device 
> *rbd_dev, u64 offset)
> if (ret < 0 || ret > CEPH_MAX_OID_NAME_LEN) {
> pr_err("error formatting segment name for #%llu (%d)\n",
> segment, ret);
> -   kfree(name);
> +   kmem_cache_free(rbd_segment_name_cache, name);
> name = NULL;
> }

We seem to have a helper for this, rbd_segment_name_free().  Care to
resend?

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: Use kmem_cache_free

2014-07-22 Thread Himangi Saraogi
Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree.

The Coccinelle semantic patch that makes this change is as follows:

// 
@@
expression x,E,c;
@@

 x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
 ... when != x = E
 when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// 

Signed-off-by: Himangi Saraogi 
Acked-by: Julia Lawall 
---
 drivers/block/rbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b2c98c1..8381c54 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1158,7 +1158,7 @@ static const char *rbd_segment_name(struct rbd_device 
*rbd_dev, u64 offset)
if (ret < 0 || ret > CEPH_MAX_OID_NAME_LEN) {
pr_err("error formatting segment name for #%llu (%d)\n",
segment, ret);
-   kfree(name);
+   kmem_cache_free(rbd_segment_name_cache, name);
name = NULL;
}
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-memstore and wip-objectstore

2014-07-22 Thread Haomai Wang
Thanks, I will dive into it and fix it next.

On Tue, Jul 22, 2014 at 11:49 PM, Sage Weil  wrote:
> Hi Haomai,
>
> Hmm, one other thing: I'm testing the fix in wip-8701 and it is tripping
> over the KeyValueStore test.  This
>
>  ./ceph_test_objectstore 
> --gtest_filter=ObjectStore/StoreTest.BigRGWObjectName/1
>
> fails with
>
>  0> 2014-07-22 08:45:25.640932 7fe617fff700 -1 *** Caught signal 
> (Segmentation fault) **
>  in thread 7fe617fff700
>
>  ceph version 0.82-649-gc5732e4 (c5732e4aefbd80f29b766756478d79808f0245d7)
>  1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa09a7b]
>  2: ./ceph_test_objectstore() [0xb145b6]
>  3: (()+0x10340) [0x7fe621313340]
>  4: (std::pair::pair(unsigned long const&, 
> ghobject_t const&)+0x18) [0xadb710]
>  5: (std::map, 
> std::allocator > 
> >::operator[](unsigned long const&)+0x109) [0xad957f]
>  6: (RandomCache std::tr1::shared_ptr > 
> >::trim_cache(unsigned long)+0xc2) [0xad7218]
>  7: (RandomCache std::tr1::shared_ptr > >::add(ghobject_t, 
> std::pair 
> >)+0x70) [0xad53a2]
>  8: (StripObjectMap::lookup_strip_header(coll_t const&, ghobject_t const&, 
> std::tr1::shared_ptr*)+0x4d3) [0xab1fb1]
>  9: (KeyValueStore::BufferTransaction::lookup_cached_header(coll_t const&, 
> ghobject_t const&, std::tr1::shared_ptr*, 
> bool)+0x1dc) [0xab32c0]
>  10: (KeyValueStore::_remove(coll_t, ghobject_t const&, 
> KeyValueStore::BufferTransaction&)+0x188) [0xac0472]
>  11: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, 
> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0x632) [0xabba0c]
>  12: (KeyValueStore::_do_transactions(std::list std::allocator >&, unsigned long, 
> ThreadPool::TPHandle*)+0x138) [0xabb2ee]
>  13: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x1f5) [0xabac11]
>  14: (KeyValueStore::OpWQ::_process(KeyValueStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x2f) [0xad3f23]
>  15: (ThreadPool::WorkQueue::_void_process(void*, 
> ThreadPool::TPHandle&)+0x33) [0xadf645]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0xb28f7c]
>  17: (ThreadPool::WorkThread::entry()+0x23) [0xb2d031]
>  18: (Thread::entry_wrapper()+0x79) [0xb21647]
>  19: (Thread::_entry_func(void*)+0x18) [0xb215c4]
>  20: (()+0x8182) [0x7fe62130b182]
>  21: (clone()+0x6d) [0x7fe61fc8330d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> It's a new test (long file names and collection_move) that addresses an
> issue with the FileStore, but KeyValueStore doesn't seem to like it
> either...
>
> Thanks!
> sage
>
>
> On Tue, 22 Jul 2014, Sage Weil wrote:
>
>> Hi Haomai,
>>
>> Do you mind looking at wip-memstore at
>>
>>   https://github.com/ceph/ceph/pull/2125
>>
>> A couple minor fixes and then we can enable it in ceph_test_objectstore.
>>
>> Also, I would love any feedback on wip-objectstore
>>
>>   https://github.com/ceph/ceph/pull/2124
>>
>> That one is RFC at this point.  I'm trying to simplify the ObjectStore
>> interface as much as possible.
>>
>> Thanks!
>> sage
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-memstore and wip-objectstore

2014-07-22 Thread Sage Weil
Hi Haomai,

Hmm, one other thing: I'm testing the fix in wip-8701 and it is tripping 
over the KeyValueStore test.  This

 ./ceph_test_objectstore --gtest_filter=ObjectStore/StoreTest.BigRGWObjectName/1

fails with

 0> 2014-07-22 08:45:25.640932 7fe617fff700 -1 *** Caught signal 
(Segmentation fault) **
 in thread 7fe617fff700

 ceph version 0.82-649-gc5732e4 (c5732e4aefbd80f29b766756478d79808f0245d7)
 1: (ceph::BackTrace::BackTrace(int)+0x2d) [0xa09a7b]
 2: ./ceph_test_objectstore() [0xb145b6]
 3: (()+0x10340) [0x7fe621313340]
 4: (std::pair::pair(unsigned long const&, 
ghobject_t const&)+0x18) [0xadb710]
 5: (std::map, 
std::allocator > 
>::operator[](unsigned long const&)+0x109) [0xad957f]
 6: (RandomCache > 
>::trim_cache(unsigned long)+0xc2) [0xad7218]
 7: (RandomCache > >::add(ghobject_t, 
std::pair 
>)+0x70) [0xad53a2]
 8: (StripObjectMap::lookup_strip_header(coll_t const&, ghobject_t const&, 
std::tr1::shared_ptr*)+0x4d3) [0xab1fb1]
 9: (KeyValueStore::BufferTransaction::lookup_cached_header(coll_t const&, 
ghobject_t const&, std::tr1::shared_ptr*, 
bool)+0x1dc) [0xab32c0]
 10: (KeyValueStore::_remove(coll_t, ghobject_t const&, 
KeyValueStore::BufferTransaction&)+0x188) [0xac0472]
 11: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, 
KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0x632) [0xabba0c]
 12: (KeyValueStore::_do_transactions(std::list >&, unsigned long, 
ThreadPool::TPHandle*)+0x138) [0xabb2ee]
 13: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x1f5) [0xabac11]
 14: (KeyValueStore::OpWQ::_process(KeyValueStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x2f) [0xad3f23]
 15: (ThreadPool::WorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x33) [0xadf645]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0xb28f7c]
 17: (ThreadPool::WorkThread::entry()+0x23) [0xb2d031]
 18: (Thread::entry_wrapper()+0x79) [0xb21647]
 19: (Thread::_entry_func(void*)+0x18) [0xb215c4]
 20: (()+0x8182) [0x7fe62130b182]
 21: (clone()+0x6d) [0x7fe61fc8330d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

It's a new test (long file names and collection_move) that addresses an 
issue with the FileStore, but KeyValueStore doesn't seem to like it 
either...

Thanks!
sage


On Tue, 22 Jul 2014, Sage Weil wrote:

> Hi Haomai,
> 
> Do you mind looking at wip-memstore at 
> 
>   https://github.com/ceph/ceph/pull/2125
> 
> A couple minor fixes and then we can enable it in ceph_test_objectstore.
> 
> Also, I would love any feedback on wip-objectstore
> 
>   https://github.com/ceph/ceph/pull/2124
> 
> That one is RFC at this point.  I'm trying to simplify the ObjectStore 
> interface as much as possible.
> 
> Thanks!
> sage
> 
>   
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Andrey Korolyov
On Tue, Jul 22, 2014 at 6:28 PM, Wido den Hollander  wrote:
> On 07/22/2014 03:48 PM, Andrey Korolyov wrote:
>>
>> On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander  wrote:
>>>
>>> Hi,
>>>
>>> Currently on Ubuntu with Upstart when you invoke a restart like this:
>>>
>>> $ sudo restart ceph-osd-all
>>>
>>> It will restart all OSDs at once, which can increase the load on the
>>> system
>>> a quite a bit.
>>>
>>> It's better to restart all OSDs by restarting them one by one:
>>>
>>> $ sudo ceph restart ceph-osd id=X
>>>
>>> But you then have to figure out all the IDs by doing a find in
>>> /var/lib/ceph/osd and that's more manual work.
>>>
>>> I'm thinking of patching the init scripts which allows something like
>>> this:
>>>
>>> $ sudo restart ceph-osd-all delay=180
>>>
>>> It then waits 180 seconds between each OSD restart making the proces even
>>> smoother.
>>>
>>> I know there are currently sysvinit, upstart and systemd scripts, so it
>>> has
>>> to be implemented on various places, but how does the general idea sound?
>>>
>>> --
>>> Wido den Hollander
>>> Ceph consultant and trainer
>>> 42on B.V.
>>>
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>>> --
>>
>>
>>
>> Hi,
>>
>> this behaviour obviously have a negative side of increased overall
>> peering time and larger integral value of out-of-SLA delays. I`d vote
>> for warming up necessary files, most likely collections, just before
>> restart. If there are no enough room to hold all of them at once, we
>> can probably combine both methods to achieve lower impact value on
>> restart, although adding a simple delay sounds much more straight than
>> putting file cache to ram.
>>
>
> In the case I'm talking about there are 23 OSDs running on a single machine
> and restarting all the OSDs causes a lot of peering and reading PG logs.
>
> A warm-up mechanism might work, but that would be a lot of work.
>
> When upgrading your cluster you simply want to do this:
>
> $ dsh -g ceph-osd "sudo restart ceph-osd-all delay=180"
>
> That might take hours to complete, but if it's just an upgrade that doesn't
> matter. You want as minimal impact on service as possible.
>

I may suggest to measure impact with vmtouch[0], it decreased OSD
startup time greatly on mine tests, but I was stuck with same resource
exhaustion as before after OSD marked itself up (IOPS ceiling
primarily).


0. http://hoytech.com/vmtouch/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


wip-memstore and wip-objectstore

2014-07-22 Thread Sage Weil
Hi Haomai,

Do you mind looking at wip-memstore at 

https://github.com/ceph/ceph/pull/2125

A couple minor fixes and then we can enable it in ceph_test_objectstore.

Also, I would love any feedback on wip-objectstore

https://github.com/ceph/ceph/pull/2124

That one is RFC at this point.  I'm trying to simplify the ObjectStore 
interface as much as possible.

Thanks!
sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Wido den Hollander

On 07/22/2014 03:48 PM, Andrey Korolyov wrote:

On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander  wrote:

Hi,

Currently on Ubuntu with Upstart when you invoke a restart like this:

$ sudo restart ceph-osd-all

It will restart all OSDs at once, which can increase the load on the system
a quite a bit.

It's better to restart all OSDs by restarting them one by one:

$ sudo ceph restart ceph-osd id=X

But you then have to figure out all the IDs by doing a find in
/var/lib/ceph/osd and that's more manual work.

I'm thinking of patching the init scripts which allows something like this:

$ sudo restart ceph-osd-all delay=180

It then waits 180 seconds between each OSD restart making the proces even
smoother.

I know there are currently sysvinit, upstart and systemd scripts, so it has
to be implemented on various places, but how does the general idea sound?

--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--



Hi,

this behaviour obviously have a negative side of increased overall
peering time and larger integral value of out-of-SLA delays. I`d vote
for warming up necessary files, most likely collections, just before
restart. If there are no enough room to hold all of them at once, we
can probably combine both methods to achieve lower impact value on
restart, although adding a simple delay sounds much more straight than
putting file cache to ram.



In the case I'm talking about there are 23 OSDs running on a single 
machine and restarting all the OSDs causes a lot of peering and reading 
PG logs.


A warm-up mechanism might work, but that would be a lot of work.

When upgrading your cluster you simply want to do this:

$ dsh -g ceph-osd "sudo restart ceph-osd-all delay=180"

That might take hours to complete, but if it's just an upgrade that 
doesn't matter. You want as minimal impact on service as possible.


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Andrey Korolyov
On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander  wrote:
> Hi,
>
> Currently on Ubuntu with Upstart when you invoke a restart like this:
>
> $ sudo restart ceph-osd-all
>
> It will restart all OSDs at once, which can increase the load on the system
> a quite a bit.
>
> It's better to restart all OSDs by restarting them one by one:
>
> $ sudo ceph restart ceph-osd id=X
>
> But you then have to figure out all the IDs by doing a find in
> /var/lib/ceph/osd and that's more manual work.
>
> I'm thinking of patching the init scripts which allows something like this:
>
> $ sudo restart ceph-osd-all delay=180
>
> It then waits 180 seconds between each OSD restart making the proces even
> smoother.
>
> I know there are currently sysvinit, upstart and systemd scripts, so it has
> to be implemented on various places, but how does the general idea sound?
>
> --
> Wido den Hollander
> Ceph consultant and trainer
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> --


Hi,

this behaviour obviously have a negative side of increased overall
peering time and larger integral value of out-of-SLA delays. I`d vote
for warming up necessary files, most likely collections, just before
restart. If there are no enough room to hold all of them at once, we
can probably combine both methods to achieve lower impact value on
restart, although adding a simple delay sounds much more straight than
putting file cache to ram.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Wido den Hollander

Hi,

Currently on Ubuntu with Upstart when you invoke a restart like this:

$ sudo restart ceph-osd-all

It will restart all OSDs at once, which can increase the load on the 
system a quite a bit.


It's better to restart all OSDs by restarting them one by one:

$ sudo ceph restart ceph-osd id=X

But you then have to figure out all the IDs by doing a find in 
/var/lib/ceph/osd and that's more manual work.


I'm thinking of patching the init scripts which allows something like this:

$ sudo restart ceph-osd-all delay=180

It then waits 180 seconds between each OSD restart making the proces 
even smoother.


I know there are currently sysvinit, upstart and systemd scripts, so it 
has to be implemented on various places, but how does the general idea 
sound?


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html