from:"Sage Weil"

Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Sage Weil

On Thu, 7 Jan 2016, Javen Wu wrote:
> Hi Sage,
> 
> Sorry to bother you. I am not sure if it is appropriate to send email to you
> directly, but I cannot find any useful information to address my confusion
> from Internet. Hope you can help me.
> 
> Occasionally, I heard that you are going to start BlueFS to eliminate the
> redudancy between XFS journal and RocksDB WAL. I am a little confused.
> Is the Bluefs only to host RocksDB for BlueStore or it's an
> alternative of BlueStore?
> 
> I am a new comer to CEPH, I am not sure my understanding is correct about
> BlueStore. BlueStore in my mind is as below.
> 
>  BlueStore
>  =
>RocksDB
> +---+  +---+
> |   onode   |  |   |
> |WAL|  |   |
> |   omap|  |   |
> +---+  |   bdev|
> |   |  |   |
> |   XFS |  |   |
> |   |  |   |
> +---+  +---+

This is the picture before BlueFS enters the picture.

> I am curious if BlueFS is able to host RocksDB, actually it's already a
> "filesystem" which have to maintain blockmap kind of metadata by its own
> WITHOUT the help of RocksDB. 

Right.  BlueFS is a really simple "file system" that is *just* complicated 
enough to implement the rocksdb::Env interface, which is what rocksdb 
needs to store its log and sst files.  The after picture looks like

 ++
 | bluestore  |
 +--+ |
 | rocksdb  | |
 +--+ |
 |  bluefs  | |
 +--+-+
 |block device|
 ++

> The reason we care the intention and the design target of BlueFS is that I had
> discussion with my partner Peng.Hse about an idea to introduce a new
> ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
> already, but we had a different immature idea to use libzpool to implement a
> new
> ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
> So that we can align CEPH transaction and zfs transaction in order to  avoid
> double write for CEPH journal.
> ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
> it's platform kernel/user independent. Another benefit for the idea is we
> can extend our metadata without bothering any DBStore.
> 
> Frankly, we are not sure if our idea is realistic so far, but when I heard of
> BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One 
reason we use rocksdb (or a similar kv store) is that we need in-order 
enumeration of objects in order to do collection listing (needed for 
backfill, scrub, and omap).  You'll need something similar on top of zfs.  

I suspect the simplest path would be to also implement the rocksdb::Env 
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the 
interface that has to be implemented...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Long peering - throttle at FileStore::queue_transactions

2016-01-06 Thread Sage Weil

On Tue, 5 Jan 2016, Guang Yang wrote:
> On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil  wrote:
> > On Mon, 4 Jan 2016, Guang Yang wrote:
> >> Hi Cephers,
> >> Happy New Year! I got question regards to the long PG peering..
> >>
> >> Over the last several days I have been looking into the *long peering*
> >> problem when we start a OSD / OSD host, what I observed was that the
> >> two peering working threads were throttled (stuck) when trying to
> >> queue new transactions (writing pg log), thus the peering process are
> >> dramatically slow down.
> >>
> >> The first question came to me was, what were the transactions in the
> >> queue? The major ones, as I saw, included:
> >>
> >> - The osd_map and incremental osd_map, this happens if the OSD had
> >> been down for a while (in a large cluster), or when the cluster got
> >> upgrade, which made the osd_map epoch the down OSD had, was far behind
> >> the latest osd_map epoch. During the OSD booting, it would need to
> >> persist all those osd_maps and generate lots of filestore transactions
> >> (linear with the epoch gap).
> >> > As the PG was not involved in most of those epochs, could we only take 
> >> > and persist those osd_maps which matter to the PGs on the OSD?
> >
> > This part should happen before the OSD sends the MOSDBoot message, before
> > anyone knows it exists.  There is a tunable threshold that controls how
> > recent the map has to be before the OSD tries to boot.  If you're
> > seeing this in the real world, be probably just need to adjust that value
> > way down to something small(er).
> It would queue the transactions and then sends out the MOSDBoot, thus
> there is still a chance that it could have contention with the peering
> OPs (especially on large clusters where there are lots of activities
> which generates many osdmap epoch). Any chance we can change the
> *queue_transactions* to "apply_transactions*, thus we block there
> waiting for the persistent of the osdmap. At least we may be able to
> do that during OSD booting? The concern is, if the OSD is active, the
> apply_transaction would take longer with holding the osd_lock..
> I don't find such tuning, could you elaborate? Thanks!

Yeah, that sounds like a good idea (and clearly safe).  Probably a simpler 
fix is to just call store->flush() or similar before sending the boot 
message?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Long peering - throttle at FileStore::queue_transactions

2016-01-04 Thread Sage Weil

On Mon, 4 Jan 2016, Guang Yang wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
> 
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
> 
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
> 
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
> > As the PG was not involved in most of those epochs, could we only take and 
> > persist those osd_maps which matter to the PGs on the OSD?

This part should happen before the OSD sends the MOSDBoot message, before 
anyone knows it exists.  There is a tunable threshold that controls how 
recent the map has to be before the OSD tries to boot.  If you're 
seeing this in the real world, be probably just need to adjust that value 
way down to something small(er).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: how io works when backfill

2015-12-29 Thread Sage Weil

On Tue, 29 Dec 2015, Dong Wu wrote:
> if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
> [7, 2, 3],  is it similar with the example above?
> still install a pg_temp entry mapping the PG back to [1, 2, 3], then
> backfill happens to 7, normal io write to [1, 2, 3], if io to the
> portion of the PG that has already been backfilled will also be sent
> to osd.7?

Yes (although I forget how it picks the ordering of the osds in the temp 
mapping).  See PG::choose_acting() for the details.

> how about these examples about removing an osd:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [1, 2, 5], but osd.5 has no data, then install a
> pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [1, 2, 5]

Yes

> another example:
> - pg1.0 [1, 2, 3]
> - osd.3 down and be removed
> - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
> install a pg_temp mapping the PG back to [1, 2] which osd.1
> temporarily becomes the primary, then backfill happens to 5,
> - normal io write to [1, 2], if io hits object which has been
> backfilled to osd.5, io will also send to osd.5
> - when backfill completes, remove the pg_temp and mapping changes back
> to [5, 1, 2]
> 
> is my ananysis right?

Yep!

sage

> 
> 2015-12-29 1:30 GMT+08:00 Sage Weil :
> > On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> >> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> >> > Hi,
> >> > When add osd or remove osd, ceph will backfill to rebalance data.
> >> > eg:
> >> > - pg1.0[1, 2, 3]
> >> > - add an osd(eg. osd.7)
> >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> >> > object a is backfilling
> >> > - when a write io hits object a, then the io needs to wait for its
> >> > complete, then goes on.
> >> > - but if io hits object b which has not been backfilled, io reaches
> >> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> >> > have object b, so osd.7 needs to wait for object b to backfilled, then
> >> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> >>
> >> I think in this case, when the write of object b reaches osd.1, it
> >> holds the client write, raises the priority of the recovery of object
> >> b, and kick off the recovery of it. When the recovery of object b is
> >> done, it requeue the client write, and then everything goes like
> >> usual.
> >
> > It's more complicated than that.  In a normal (log-based) recovery
> > situation, it is something like the above: if the acting set is [1,2,3]
> > but 3 is missing the latest copy of A, a write to A will block on the
> > primary while the primary initiates recovery of A immediately.  Once that
> > completes the IO will continue.
> >
> > For backfill, it's different.  In your example, you start with [1,2,3]
> > then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> > install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> > things will proceed normally while backfill happens to 7.  Backfill won't
> > interfere with normal IO at all, except that IO to the portion of the PG
> > that has already been backfilled will also be sent to the backfill target
> > (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> > removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> > remove it's copy of the PG.
> >
> > sage
> >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: how io works when backfill

2015-12-28 Thread Sage Weil

On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
> 2015-12-27 20:48 GMT+08:00 Dong Wu :
> > Hi,
> > When add osd or remove osd, ceph will backfill to rebalance data.
> > eg:
> > - pg1.0[1, 2, 3]
> > - add an osd(eg. osd.7)
> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
> > object a is backfilling
> > - when a write io hits object a, then the io needs to wait for its
> > complete, then goes on.
> > - but if io hits object b which has not been backfilled, io reaches
> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
> > have object b, so osd.7 needs to wait for object b to backfilled, then
> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
> 
> I think in this case, when the write of object b reaches osd.1, it
> holds the client write, raises the priority of the recovery of object
> b, and kick off the recovery of it. When the recovery of object b is
> done, it requeue the client write, and then everything goes like
> usual.

It's more complicated than that.  In a normal (log-based) recovery 
situation, it is something like the above: if the acting set is [1,2,3] 
but 3 is missing the latest copy of A, a write to A will block on the 
primary while the primary initiates recovery of A immediately.  Once that 
completes the IO will continue.

For backfill, it's different.  In your example, you start with [1,2,3] 
then add in osd.7.  The OSD will see that 7 has no data for teh PG and 
install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then 
things will proceed normally while backfill happens to 7.  Backfill won't 
interfere with normal IO at all, except that IO to the portion of the PG 
that has already been backfilled will also be sent to the backfill target 
(7) so that it stays up to date.  Once it complets, the pg_temp entry is 
removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to 
remove it's copy of the PG.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to configure if there are tow network cards in Client

2015-12-28 Thread Sage Weil

On Fri, 25 Dec 2015, ?? wrote:
> Hi all,
> When we read the code, we haven?t find the function that the client can 
> bind a specific IP. In Ceph?s configuration, we could only find the parameter 
> ?public network?, but it seems acts on the OSD but not the client.
> There is a scenario that the client has two network cards named NIC1 and 
> NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
> RADOS) and the NIC2 has other services except Ceph?s client. So   we need the 
> client can bind specific IP in order to differentiate the IP communicating 
> with cluster from another IP serving other applications. We want to know is 
> there any configuration in Ceph to achieve this function? If there is, how 
> could we configure the IP? if not, could we add this function in Ceph? Thank 
> you so much.

Right.  There isn't a configurable to do this now--we've always just let 
the kernel network layer sort it out. Is this just a matter of calling 
bind on the socket before connecting? I've never done this before..

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Sage Weil

On Fri, 25 Dec 2015, Ning Yao wrote:
> Hi, Dong Wu,
> 
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
> 
> If you are interest on this, we may cooperate to do this.

I think it's a great idea.  We didn't do it before only because it is 
complicated.  The good news is that if we can't conclusively infer exactly 
which parts of hte object need to be recovered from the log entry we can 
always just fall back to recovering the whole thing.  Also, the place 
where this is currently most visible is RBD small writes:

 - osd goes down
 - client sends a 4k overwrite and modifies an object
 - osd comes back up
 - client sends another 4k overwrite
 - client io blocks while osd recovers 4mb

So even if we initially ignore truncate and omap and EC and clones and 
anything else complicated I suspect we'll get a nice benefit.

I haven't thought about this too much, but my guess is that the hard part 
is making the primary's missing set representation include a partial delta 
(say, an interval_set<> indicating which ranges of the file have changed) 
in a way that gracefully degrades to recovering the whole object if we're 
not sure.

In any case, we should definitely have the design conversation!

sage

> 
> Regards
> Ning Yao
> 
> 
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
> > Thanks, from this pull request I learned that this issue is not
> > completed, is there any new progress of this issue?
> >
> > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) :
> >> Yeah, This is good idea for recovery, but not for backfill.
> >> @YaoNing have pull a request about this
> >> https://github.com/ceph/ceph/pull/3837 this year.
> >>
> >> 2015-12-25 11:16 GMT+08:00 Dong Wu :
> >>> Hi,
> >>> I have doubt about pglog, the pglog contains (op,object,version) etc.
> >>> when peering, use pglog to construct missing list,then recover the
> >>> whole object in missing list even if different data among replicas is
> >>> less then a whole object data(eg,4MB).
> >>> why not add (offset,len) to pglog? If so, the missing list can contain
> >>> (object, offset, len), then we can reduce recover data.
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-us...@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Xinze Chi
> > ___
> > ceph-users mailing list
> > ceph-us...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: Client still connect failed leader after that mon down

2015-12-21 Thread Sage Weil

On Mon, 21 Dec 2015, Zhi Zhang wrote:
> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2...@gmail.com
>   zhangz.da...@outlook.com
> 
> 
> 
> -- Forwarded message --
> From: Jaze Lee 
> Date: Mon, Dec 21, 2015 at 4:08 PM
> Subject: Re: Client still connect failed leader after that mon down
> To: Zhi Zhang 
> 
> 
> Hello,
> I am terrible sorry.
> I think we may not need to reconstruct the monclient.{h,cc}, we find
> the parameter is mon_client_hunt_interval is very usefull.
> When we set mon_client_hunt_interval = 0.5? the time to run a ceph
> command is very small even it first connects the down leader mon.
> 
> The first time i ask the question was because we find the parameter
> from official site
> http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/.
> It is write in this
> 
> mon client hung interval

Yep, that's a typo. Do you mind submitting a patch to fix it?

Thanks!
sage


> 
> Description:The client will try a new monitor every N seconds until it
> establishes a connection.
> Type:Double
> Default:3.0
> 
> And we set it. it is not work.
> 
> I think may be it is a slip of pen?
> The right configuration parameter should be mon client hunt interval
> 
> Can someone please help me to fix this in official site?
> 
> Thanks a lot.
> 
> 
> 
> 2015-12-21 14:00 GMT+08:00 Jaze Lee :
> > right now we use simple msg, and cpeh version is 0.80...
> >
> > 2015-12-21 10:55 GMT+08:00 Zhi Zhang :
> >> Which msg type and ceph version are you using?
> >>
> >> Once we used 0.94.1 with async msg, we encountered similar issue.
> >> Client was trying to connect a down monitor when it was just started
> >> and this connection would hung there. This is because previous async
> >> msg used blocking connection mode.
> >>
> >> After we back ported non-blocking mode of async msg from higher ceph
> >> version, we haven't encountered such issue yet.
> >>
> >>
> >> Regards,
> >> Zhi Zhang (David)
> >> Contact: zhang.david2...@gmail.com
> >>   zhangz.da...@outlook.com
> >>
> >>
> >> On Fri, Dec 18, 2015 at 11:41 AM, Jevon Qiao  wrote:
> >>> On 17/12/15 21:27, Sage Weil wrote:
> >>>>
> >>>> On Thu, 17 Dec 2015, Jaze Lee wrote:
> >>>>>
> >>>>> Hello cephers:
> >>>>>  In our test, there are three monitors. We find client run ceph
> >>>>> command will slow when the leader mon is down. Even after long time, a
> >>>>> client run ceph command will also slow in first time.
> >>>>> >From strace, we find that the client first to connect the leader, then
> >>>>> after 3s, it connect the second.
> >>>>> After some search we find that the quorum is not change, the leader is
> >>>>> still the down monitor.
> >>>>> Is that normal?  Or is there something i miss?
> >>>>
> >>>> It's normal.  Even when the quorum does change, the client doesn't
> >>>> know that.  It should be contacting a random mon on startup, though, so I
> >>>> would expect the 3s delay 1/3 of the time.
> >>>
> >>> That's because client randomly picks up a mon from Monmap. But what we
> >>> observed is that when a mon is down no change is made to monmap(neither 
> >>> the
> >>> epoch nor the members). Is it the culprit for this phenomenon?
> >>>
> >>> Thanks,
> >>> Jevon
> >>>
> >>>> A long-standing low-priority feature request is to have the client 
> >>>> contact
> >>>> 2 mons in parallel so that it can still connect quickly if one is down.
> >>>> It's requires some non-trivial work in mon/MonClient.{cc,h} though and I
> >>>> don't think anyone has looked at it seriously.
> >>>>
> >>>> sage
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majord...@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > 
> 
> 
> 
> --
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Sage Weil

On Thu, 17 Dec 2015, Loic Dachary wrote:
> Hi Ilya,
> 
> This is another puzzling behavior (the log of all commands is at 
> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a 
> series of sgdisk -i commands to examine various devices including 
> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup 
> again although I don't have a definitive proof of this).
> 
> It looks like a side effect of a previous partprobe command, the only 
> command I can think of that removes / re-adds devices. I thought calling 
> udevadm settle after running partprobe would be enough to ensure 
> partprobe completed (and since it takes as much as 2mn30 to return, I 
> would be shocked if it does not ;-).
> 
> Any idea ? I desperately try to find a consistent behavior, something 
> reliable that we could use to say : "wait for the partition table to be 
> up to date in the kernel and all udev events generated by the partition 
> table update to complete".

I wonder if the underlying issue is that we shouldn't be calling udevadm 
settle from something running from udev.  Instead, of a udev-triggered 
run of ceph-disk does something that changes the partitions, it 
should just exit and let udevadm run ceph-disk again on the new 
devices...?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Client still connect failed leader after that mon down

2015-12-17 Thread Sage Weil

On Thu, 17 Dec 2015, Jaze Lee wrote:
> Hello cephers:
> In our test, there are three monitors. We find client run ceph
> command will slow when the leader mon is down. Even after long time, a
> client run ceph command will also slow in first time.
> >From strace, we find that the client first to connect the leader, then
> after 3s, it connect the second.
> After some search we find that the quorum is not change, the leader is
> still the down monitor.
> Is that normal?  Or is there something i miss?

It's normal.  Even when the quorum does change, the client doesn't 
know that.  It should be contacting a random mon on startup, though, so I 
would expect the 3s delay 1/3 of the time.

A long-standing low-priority feature request is to have the client contact 
2 mons in parallel so that it can still connect quickly if one is down.  
It's requires some non-trivial work in mon/MonClient.{cc,h} though and I 
don't think anyone has looked at it seriously.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Improving Data-At-Rest encryption in Ceph

2015-12-16 Thread Sage Weil

On Wed, 16 Dec 2015, Adam Kupczyk wrote:
> On Tue, Dec 15, 2015 at 3:23 PM, Lars Marowsky-Bree  wrote:
> > On 2015-12-14T14:17:08, Radoslaw Zarzynski  wrote:
> >
> > Hi all,
> >
> > great to see this revived.
> >
> > However, I have come to see some concerns with handling the encryption
> > within Ceph itself.
> >
> > The key part to any such approach is formulating the threat scenario.
> > For the use cases we have seen, the data-at-rest encryption matters so
> > they can confidently throw away disks without leaking data. It's not
> > meant as a defense against an online attacker. There usually is no
> > problem with "a few" disks being privileged, or one or two nodes that
> > need an admin intervention for booting (to enter some master encryption
> > key somehow, somewhere).
> >
> > However, that requires *all* data on the OSDs to be encrypted.
> >
> > Crucially, that includes not just the file system meta data (so not just
> > the data), but also the root and especially the swap partition. Those
> > potentially include swapped out data, coredumps, logs, etc.
> >
> > (As an optional feature, it'd be cool if an OSD could be moved to a
> > different chassis and continue operating there, to speed up recovery.
> > Another optional feature would be to eventually be able, for those
> > customers that trust them ;-), supply the key to the on-disk encryption
> > (OPAL et al).)
> >
> > The proposal that Joshua posted a while ago essentially remained based
> > on dm-crypt, but put in simple hooks to retrieve the keys from some
> > "secured" server via sftp/ftps instead of loading them from the root fs.
> > Similar to deo, that ties the key to being on the network and knowing
> > the OSD UUID.
> >
> > This would then also be somewhat easily extensible to utilize the same
> > key management server via initrd/dracut.
> >
> > Yes, this means that each OSD disk is separately encrypted, but given
> > modern CPUs, this is less of a problem. It does have the benefit of
> > being completely transparent to Ceph, and actually covering the whole
> > node.
> Agreed, if encryption is infinitely fast dm-crypt is best solution.
> Below is short analysis of encryption burden for dm-crypt and
> OSD-encryption when using replicated pools.
> 
> Summary:
> OSD encryption requires 2.6 times less crypto operations then dm-crypt.

Yeah, I believe that, but

> Crypto ops are bottleneck.

is this really true?  I don't think we've tried to measure performance 
with dm-crypt, but I also have never heard anyone complain about the 
additional CPU utilization or performance impact.  Have you observed this?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cmake

2015-12-16 Thread Sage Weil

On Wed, 16 Dec 2015, Matt Benjamin wrote:
> I'm going to push for cmake work already in progress to be moved to the 
> next milestone ASAP.
> 
> With respect to "make check" blockers, which contains the issue of where 
> cmake puts built objects.  Ali, Casey, and I discussed this today at 
> some length.  We think the current "hackery" to make cmake make check 
> work "the same way" auto* did is long-term undesirable due to it 
> mutating files in the src dir.  I have not assumed that it would be an 
> improvement to put all objects built in a tree of submakes into a single 
> dir, as automake does.  I do think it is essential that at least 
> eventually, it makes it simple to operate on any object that is built, 
> and simple to extend processes like make check.

All of the binaries eventually go into /usr[/local]/bin anyway.  Can we 
do the same here?  (I don't care where intermediate .lo or .o objects 
go...)

> Ali and Casey agree, but contend that the current make check work is 
> "almost finished"--specifically, that it could be finished and a PR sent 
> -this week-.  Rewriting it will take additional time.  They propose 
> starting with finishing and documenting the current setup, then doing a 
> larger cleanup.
> 
> What do others think?

I'd like to aim for what will work best long-term, and that undoubtable 
also includes some source tree reorganization/cleanup.  But.. it seems 
like John's suggestion would do the trick here?

sage


> 
> Matt
> 
> > >
> > > I seems like the main problem is that automake puts all build targets in
> > > src/ and cmake spreads them all over build/*.  This makes that you can't
> > > just add ./ to anything that would normally be in your path (or,
> > > PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh).
> > > There's a bunch of kludges in vstart.sh to make it work that I think
> > > mostly point to this issue (and the .libs things).  Is there simply an
> > > option we can give cmake to make it put built binaries directly in build/?
> > >
> > > Stepping back a bit, it seems like the goals should be
> > >
> > > 1. Be able to completely replace autotools.  I don't fancy maintaining
> > > both in parallel.
> > >
> > 
> > Is cmake a viable option in all environments we expect ceph (or any
> > part of) to be compiled on? (e.g. aix, solaris, freebsd, different
> > linux arm distros, etc.)
> 
> One cannot expect cmake to be pre-installed on those platforms, but it will 
> work on every one you mentioned, some others, not to mention Windows.
> 
> > 
> > > 2. Be able to run vstart etc from the build dir.
> > 
> > There's an awful hack currently in vstart.sh and stop.sh that checks
> > for CMakeCache.txt in the current work directory to verify whether we
> > built using cmake or autotools. Can we make this go away?
> > We can do something like having the build system create a
> > 'ceph-setenv.sh' script that would set the env (or open a shelll) with
> > the appropriate paths.
> 
> 
> 
> > 
> > >
> > > 3. Be able to run ./ceph[-anything] from the build dir, or put the build
> > > dir in the path.  (I suppose we could rely in a make install step, but
> > > that seems like more hassle... hopefully it's not neceesary?)
> > >
> > > 4. make check has to work
> > >
> > > 5. Use make-dist.sh to generate a release tarball (not make dist)
> > >
> > > 6. gitbuilders use make-dist.sh and cmake to build packages
> > >
> > > 7. release process uses make-dist.sh and cmake to build a relelase
> > >
> > > I'm probably missing something?
> > >
> > > Should we set a target of doing the 10.0.2 or .3 with cmake?
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> -- 
> -- 
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
> 
> http://www.redhat.com/en/technologies/storage
> 
> tel.  734-707-0660
> fax.  734-769-8938
> cel.  734-216-5309
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cmake

2015-12-16 Thread Sage Weil

On Wed, 16 Dec 2015, John Spray wrote:
> On Wed, Dec 16, 2015 at 5:33 PM, Sage Weil  wrote:
> > The work to transition to cmake has stalled somewhat.  I've tried to use
> > it a few times but keep running into issues that make it unusable for me.
> > Not having make check is a big one, but I think the hackery required to
> > get that going points to the underlying problem(s).
> >
> > I seems like the main problem is that automake puts all build targets in
> > src/ and cmake spreads them all over build/*.  This makes that you can't
> > just add ./ to anything that would normally be in your path (or,
> > PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh).
> > There's a bunch of kludges in vstart.sh to make it work that I think
> > mostly point to this issue (and the .libs things).  Is there simply an
> > option we can give cmake to make it put built binaries directly in build/?
> >
> > Stepping back a bit, it seems like the goals should be
> >
> > 1. Be able to completely replace autotools.  I don't fancy maintaining
> > both in parallel.
> 
> Yes!
> 
> > 2. Be able to run vstart etc from the build dir.
> 
> I'm currently doing this (i.e. being in the build dir and running
> ../src/vstart.sh), along with the vstart_runner.py for cephfs tests.
> I did indeed have to make sure that vstart_runner was aware of the
> differing binary paths though.
> 
> Though I'm obviously using just MDS+OSD, so I might be overstating the
> extent to which it currently works.
> 
> > 3. Be able to run ./ceph[-anything] from the build dir, or put the build
> > dir in the path.  (I suppose we could rely in a make install step, but
> > that seems like more hassle... hopefully it's not neceesary?)
> 
> Shall we just put all our libs and binaries in one place?  This works for me:
> set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
> set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
> set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
> 
> (to get a bin/ and a lib/ with absolutely everything in)
> 
> That way folks can either get used to typing bin/foo instead of ./foo,
> or add bin/ to their path.

I like the sound of this... ^^  it sort of mirrors what a 
make install would do.

sage


> 
> > 4. make check has to work
> >
> > 5. Use make-dist.sh to generate a release tarball (not make dist)
> >
> > 6. gitbuilders use make-dist.sh and cmake to build packages
> >
> > 7. release process uses make-dist.sh and cmake to build a relelase
> >
> > I'm probably missing something?
> >
> > Should we set a target of doing the 10.0.2 or .3 with cmake?
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

cmake

2015-12-16 Thread Sage Weil

The work to transition to cmake has stalled somewhat.  I've tried to use 
it a few times but keep running into issues that make it unusable for me.  
Not having make check is a big one, but I think the hackery required to 
get that going points to the underlying problem(s).

I seems like the main problem is that automake puts all build targets in 
src/ and cmake spreads them all over build/*.  This makes that you can't 
just add ./ to anything that would normally be in your path (or, 
PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh).  
There's a bunch of kludges in vstart.sh to make it work that I think 
mostly point to this issue (and the .libs things).  Is there simply an 
option we can give cmake to make it put built binaries directly in build/?

Stepping back a bit, it seems like the goals should be

1. Be able to completely replace autotools.  I don't fancy maintaining 
both in parallel.

2. Be able to run vstart etc from the build dir.

3. Be able to run ./ceph[-anything] from the build dir, or put the build 
dir in the path.  (I suppose we could rely in a make install step, but 
that seems like more hassle... hopefully it's not neceesary?)

4. make check has to work

5. Use make-dist.sh to generate a release tarball (not make dist)

6. gitbuilders use make-dist.sh and cmake to build packages

7. release process uses make-dist.sh and cmake to build a relelase

I'm probably missing something?

Should we set a target of doing the 10.0.2 or .3 with cmake?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Improving Data-At-Rest encryption in Ceph

2015-12-15 Thread Sage Weil

I agree with Lars's concerns: the main problems with the current dm-crypt 
approach are that there isn't any key management integration yet and the 
root volume and swap aren't encrypted. Those are easy to solve (and I'm 
hoping we'll be able to address them in time for Jewel).

On the other hand, implementing encryption within RADOS will be complex, 
and I don't see what the benefits are over whole-disk encryption.  Can 
someone summarize what per-pool encryption keys and the ability to rotate 
keys gives us?  If the threat is an attacker who is on the storage network 
and has compromised an OSD the game is pretty much up...

At a high level, I think almost anything beyond at-rest encryption (that 
is aimed at throwing out disks or physically walking a server out of the 
data center) turns into a key management and threat mitigation design 
nightmare (with few, if any, compelling solutions) until you give up and 
have clients encrypt their data and don't trust the cluster with the keys 
at all...

sage


On Tue, 15 Dec 2015, Lars Marowsky-Bree wrote:
> On 2015-12-14T14:17:08, Radoslaw Zarzynski  wrote:
> 
> Hi all,
> 
> great to see this revived.
> 
> However, I have come to see some concerns with handling the encryption
> within Ceph itself.
> 
> The key part to any such approach is formulating the threat scenario.
> For the use cases we have seen, the data-at-rest encryption matters so
> they can confidently throw away disks without leaking data. It's not
> meant as a defense against an online attacker. There usually is no
> problem with "a few" disks being privileged, or one or two nodes that
> need an admin intervention for booting (to enter some master encryption
> key somehow, somewhere).
> 
> However, that requires *all* data on the OSDs to be encrypted.
> 
> Crucially, that includes not just the file system meta data (so not just
> the data), but also the root and especially the swap partition. Those
> potentially include swapped out data, coredumps, logs, etc.
> 
> (As an optional feature, it'd be cool if an OSD could be moved to a
> different chassis and continue operating there, to speed up recovery.
> Another optional feature would be to eventually be able, for those
> customers that trust them ;-), supply the key to the on-disk encryption
> (OPAL et al).)
> 
> The proposal that Joshua posted a while ago essentially remained based
> on dm-crypt, but put in simple hooks to retrieve the keys from some
> "secured" server via sftp/ftps instead of loading them from the root fs.
> Similar to deo, that ties the key to being on the network and knowing
> the OSD UUID.
> 
> This would then also be somewhat easily extensible to utilize the same
> key management server via initrd/dracut.
> 
> Yes, this means that each OSD disk is separately encrypted, but given
> modern CPUs, this is less of a problem. It does have the benefit of
> being completely transparent to Ceph, and actually covering the whole
> node.
> 
> Of course, one of the key issues is always the key server.
> Putting/retrieving/deleting keys is reasonably simple, but the question
> of how to ensure HA for it is a bit tricky. But doable; people have been
> building HA ftp/http servers for a while ;-) Also, a single key server
> setup could theoretically serve multiple Ceph clusters.
> 
> It's not yet perfect, but I think the approach is superior to being
> implemented in Ceph natively. If there's any encryption that should be
> implemented in Ceph, I believe it'd be the on-the-wire encryption to
> protect against evasedroppers.
> 
> Other scenarios would require client-side encryption.
> 
> > Current data at rest encryption is achieved through dm-crypt placed
> > under OSD?s filestore. This solution is a generic one and cannot
> > leverage Ceph-specific characteristics. The best example is that
> > encryption is done multiple times - one time for each replica. Another
> > issue is lack of granularity - either OSD encrypts nothing, or OSD
> > encrypts everything (with dm-crypt on).
> 
> True. But for the threat scenario, a holistic approach to encryption
> seems actually required.
> 
> > Cryptographic keys are stored on filesystem of storage node that hosts
> > OSDs. Changing them require redeploying the OSDs.
> 
> This is solvable by storing the key on an external key server.
> 
> Changing the key is only necessary if the key has been exposed. And with
> dm-crypt, that's still possible - it's not the actual encryption key
> that's stored, but the secret that is needed to unlock it, and that can
> be re-encrypted quite fast. (In theory; it's not implemented yet for
> the Ceph OSDs.)
> 
> 
> > Data incoming from Ceph clients would be encrypted by primary OSD. It
> > would replicate ciphertext to non-primary members of an acting set.
> 
> This still exposes data in coredumps or on swap on the primary OSD, and
> metadata on the secondaries.
> 
> 
> Regards,
> Lars
> 
> -- 
> Architect Storage/HA
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane

Re: The max single write IOPS on single RBD

2015-12-11 Thread Sage Weil

On Fri, 11 Dec 2015, Zhi Zhang wrote:
> Hi Guys,
> 
> We have a small 4 nodes cluster. Here is the hardware configuration.
> 
> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
> all the nodes connected within one 1Gb/s network.
> 
> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
> fio. Here are the major fio options.
> 
> -direct=1
> -rw=randwrite
> -ioengine=psync
> -size=1000M
> -bs=4k
> -numjobs=1
> 
> The max IOPS we can achieve for single write (numjobs=1) is close to
> 1000. This means each IO from RBD takes 1.x ms.
> 
> >From osd logs, we can also observe most of osd_ops will take 1.x ms,
> including op processing, journal writing, replication, etc, before
> sending commit back to client.
> 
> The network RTT is around 0.04 ms;
> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 
> ms;
> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
> 
> We even tried to modify journal to write page cache only, but didn't
> get very significant improvement. Does it mean this is the best result
> we can get for single write on single RBD?

What version is this?  There have been a few recent changes that will 
reduce the wall clock time spent preparing/processing a request.  There is 
still a fair bit of work to do here, though--the theoretical lower bound 
is the SSD write time + 2x RTT (client <-> primary osd <-> replica osd <-> 
replica ssd).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Sage Weil

On Thu, 10 Dec 2015, Jan Schermer wrote:
> Removing snapshot means looking for every *potential* object the snapshot can 
> have, and this takes a very long time (6TB snapshot will consist of 1.5M 
> objects (in one replica) assuming the default 4MB object size). The same 
> applies to large thin volumes (don't try creating and then dropping a 1 EiB 
> volume, even if you only have 1GB of physical space :)).
> Doing this is simply expensive and might saturate your OSDs. If you don't 
> have enough RAM to cache the structure then all the "is there a file 
> /var/lib/ceph/" will go to disk and that can hurt a lot.
> I don't think there's any priority to this (is there?), so it competes with 
> everything else.
> 
> I'm not sure how snapshots are exactly coded in Ceph, but in a COW filesystem 
> you simply don't dereference blocks of the parent of the  snapshot when doing 
> writes to it and that's cheap, but Ceph stores "blocks" in files with 
> computable names and has no pointers to them that could be modified,  so by 
> creating a snapshot you hurt the performance a lot (you need to create a copy 
> of the 4MB object into the snapshot(s) when you dirty a byte in there). 
> Though I remember reading that the logic is actually reversed and it is the 
> snapshot that gets the original blocks(??)...
> Anyway if you are removing snapshot at the same time as writing to the parent 
> there could be potentionaly a problem in what gets done first. Is Ceph smart 
> enough to not care about snapshots that are getting deleted? I have no idea 
> but I think it must be because we use snapshots a lot and haven't had that 
> any issues with it.

It's not quite so bad... the OSD maintains a map (in leveldb) of the 
objects that are referenced by a snapshot, so the amount of work is 
proportional to the number of objects that were cloned for that snapshot.

There is certainly room for improvement in terms of the impact on client 
IO, though.  :)

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new OSD re-using old OSD id fails to boot

2015-12-09 Thread Sage Weil

On Wed, 9 Dec 2015, David Zafman wrote:
> On 12/9/15 2:39 AM, Wei-Chung Cheng wrote:
> > Hi Loic,
> > 
> > I try to reproduce this problem on my CentOS7.
> > I can not do the same issue.
> > This is my version:
> > ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> > Would you describe more detail?
> > 
> > 
> > Hi David, Sage,
> > 
> > In most of time, when we found the osd failure, the OSD is already in
> > `out` state.
> > It could not avoid the redundant data movement unless we could set the
> > osd noout when failure.
> > Is it right? (Means if OSD go into `out` state, it will make some
> > redundant data movement)
> Yes, one case would be that during the 5 minute down window of an OSD disk
> failure, the noout flag can be set if a spare disk is available.  Another
> scenario would be a bad SMART status or noticing EIO errors from a disk
> prompting a replacement.  So if a spare disk is already installed or you have
> hot swappable drives, it would be nice to replace the drive and let recovery
> copy back all the data that should be there.  Using noout would be critical to
> this effort.
> 
> I don't understand why Sage suggests below that a down+out phase would be
> required during the replacement.

Hmm, I wasn't thinking about a hot spare scenario.  We've always assumed 
that there is no point to hot spares--you may as well have them 
participating in the cluster, doing useful work, and let the failure 
rebalance distributed across all disks (and not hammer the replacement).

sage


> > 
> > Could we try the traditional spare behavior? (Set some disks backup
> > and auto replace the broken device?)
> > 
> > That can replace the failure osd before it go into the `out` state.
> > Or we could always set the osd noout?
> > 
> > In fact, I think these is a different problems between David and Loic.
> > (these two problems are the same import :p
> > 
> > If you have any problems, feel free to let me know.
> > 
> > thanks!!
> > vicente
> > 
> > 
> > 2015-12-09 10:50 GMT+08:00 Sage Weil :
> > > On Tue, 8 Dec 2015, David Zafman wrote:
> > > > Remember I really think we want a disk replacement feature that would
> > > > retain
> > > > the OSD id so that it avoids unnecessary data movement.  See tracker
> > > > http://tracker.ceph.com/issues/13732
> > > Yeah, I totally agree.  We just need to form an opinion on how... probably
> > > starting with the user experience.  Ideally we'd go from up + in to down +
> > > in to down + out, then pull the drive and replace, and then initialize a
> Here 
> > > new OSD with the same id... and journal partition.  Something like
> > > 
> > >ceph-disk recreate id=N uuid=U 
> > > 
> > > I.e., it could use the uuid (which the cluster has in the OSDMap) to find
> > > (and re-use) the journal device.
> > > 
> > > For a journal failure it'd probably be different.. but maybe not?
> > > 
> > > Any other ideas?
> > > 
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new OSD re-using old OSD id fails to boot

2015-12-09 Thread Sage Weil

On Wed, 9 Dec 2015, Wei-Chung Cheng wrote:
> Hi Loic,
> 
> I try to reproduce this problem on my CentOS7.
> I can not do the same issue.
> This is my version:
> ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> Would you describe more detail?
> 
> 
> Hi David, Sage,
> 
> In most of time, when we found the osd failure, the OSD is already in
> `out` state.
> It could not avoid the redundant data movement unless we could set the
> osd noout when failure.
> Is it right? (Means if OSD go into `out` state, it will make some
> redundant data movement)
> 
> Could we try the traditional spare behavior? (Set some disks backup
> and auto replace the broken device?)
> 
> That can replace the failure osd before it go into the `out` state.
> Or we could always set the osd noout?

I don't think there is a problem with 'out' if the osd id is reused and 
the crush position remains the same.  And I expect usually the OSD will be 
replaced by a disk with a similar size.  If the replacement is smaller (or 
0--removed entirely) then you get double-movement, but if it's the same or 
larger I think it's fine.

The sequence would be something like

 up + in
 down + in 
 5-10 minutes go by
 down + out(marked out by monitor)
  new replicas uniformly distributed across cluster
 days go by
 disk removed
 new disk inserted
 ceph-disk recreate ... recreates osd dir w/ the same id, new uuid
 on startup, osd adjusts crush weight (maybe.. usually by a smallish amount) 
 up + in
  replicas migrate back to new device

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem about pgmeta object?

2015-12-09 Thread Sage Weil

On Wed, 9 Dec 2015, Ning Yao wrote:
> The function and transactions corresponding with pgmeta object is listed
> below:
> touch()   and   remove()  for  pgmeta object creation and deletion
> _omap_setkeys()  and  _omap_rmkeys()  to  update k/v data in omap for
> pgmeta(pg_info, epoch, rollback_info, pg_log and so on)
> omap_get_values() to retrieve data from omap, always call it during osd
> daemon start proccess

Note that we should make sure collection_list continues to show it, so we 
can't skip creating the file...
 
> So I think the performance will be hurt in writeOp routine as heavily call
> _omap_setkeys()  and  _omap_rmkeys(). The ideal way to deal with it,  of
> course, is to eventually treat pgmeta as a logical object(not real object
> existed). But it needs alter much in FileStore, since we cannot directly
> remove the t.touch()  and  t.remove() transaction and upgrade store to
> remove the pgmeta_object based on info_struct_v. But as we discussed before,
> it is FileStore issue and need keep compatibility for other backends. 
> Therefore, in order to not to alter FileStore too much and keep all its
> compatibility, my strategy is just to skip the meta object existence
> checking on the main routine of writeOps like this:
> https://github.com/ceph/ceph/pull/6870

Yep, this sounds like the right hack to me.
 
> This brings a benefit to reduce _omap_setkeys()  execution time
>  from 123.784us to 108.444us averagely (about 15% improvement), and reduce
> the whole cpu time 0.5% ~ 1% globally. 

That seems worthwhile!

sage

Re: Quering since when a PG is inactive

2015-12-09 Thread Sage Weil

Hi Wido!

On Wed, 9 Dec 2015, Wido den Hollander wrote:
> Hi,
> 
> I'm working on a patch in PGMonitor.cc that sets the state to HEALTH_ERR
> if >= X PGs are stuck non-active.
> 
> This works for me now, but I would like to add a timer that a PG has to
> be inactive for more than Y seconds.
> 
> The PGMap contains "last_active" and "last_clean", but these timestamps
> are never updated. So I can't query for last_active =< (now() - 300) for
> example.
> 
> On a idle test cluster I have a PG for example:
> 
> "last_active": "2015-12-09 02:32:31.540712",
> 
> It's currently 08:53:56 here, so I can't check against last_active.
> 
> What would a good way be to see for how long a PG has been inactive?

It sounds like maybe the current code is subtley broken:

https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2566

The last_active/clean etc should be fresh within 
osd_pg_stat_report_interval_max seconds...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Filestore without journal

2015-12-09 Thread Sage Weil

On Wed, 9 Dec 2015, changtao381 wrote:
> Hi Cephers,
> 
> Why it is use journal with Filestore ?  From my understand, it is used to
> prevent partial write.
> 
> In my view ,it is needn't journal for Filestore as to the scenario for EC
> backend and for RGW object storage application. 
> 
> For EC backend, it just now only have write full and append write, truncate
> operations which already has rollback mechanism 
> 
> For RGW application, it only has append write. 
> 
> So, in my opinion, it should develop a type of filestore with no journal,
> and it writes data using aio which directly persisted data to disk.
> And the performance are also very good for large io.
> 
> Am I right? Thanks!

In theory, yes.  In reality, no--this isn't how the FileStore was 
implemented and it would take significant work to make it behave this way.  

Note that we are focusing our efforts instead on new implementations of 
the ObjectStore interface instead of investing time in rearchitecting 
FileStore (which, although slow, is known to be stable).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new OSD re-using old OSD id fails to boot

2015-12-08 Thread Sage Weil

On Tue, 8 Dec 2015, David Zafman wrote:
> Remember I really think we want a disk replacement feature that would retain
> the OSD id so that it avoids unnecessary data movement.  See tracker
> http://tracker.ceph.com/issues/13732

Yeah, I totally agree.  We just need to form an opinion on how... probably 
starting with the user experience.  Ideally we'd go from up + in to down + 
in to down + out, then pull the drive and replace, and then initialize a 
new OSD with the same id... and journal partition.  Something like

  ceph-disk recreate id=N uuid=U 

I.e., it could use the uuid (which the cluster has in the OSDMap) to find 
(and re-use) the journal device.

For a journal failure it'd probably be different.. but maybe not?

Any other ideas?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem about pgmeta object?

2015-12-08 Thread Sage Weil

On Tue, 8 Dec 2015, Ning Yao wrote:
> Umm, it seems that MemStore requires in memory meta object to keep the
> attributes. So it is not a direct way to remove the pg_meta object
> backend storage. Any suggestions?
> I think we can just skip the pg_meta operation in FileStore api based
> on the value of hoid.pgmeta(), not a generic strategy for all
> backends.

Yeah--a FileStore hack makes more sense, as other backends probably 
won't have the same inefficiency that FileStore does.

Can you summarize the strategy, though?  We want to avoid doing anything 
to weird to FileStore at this point unless there is a big performance win.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD public / cluster network isolation using VRF:s

2015-12-07 Thread Sage Weil

On Mon, 7 Dec 2015, Martin Millnert wrote:
> > Note that on a largish cluster the public/client traffic is all 
> > north-south, while the backend traffic is also mostly north-south to the 
> > top-of-rack and then east-west.  I.e., within the rack, almost everything 
> > is north-south, and client and replication traffic don't look that 
> > different.
> 
> This problem domain is one of the larger challenges. I worry about
> network timeouts for critical cluster traffic in one of the clusters due
> to hosts having 2x1GbE. I.e. in our case I want to
> prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
> health traffic primarily, and secondarily cluster replication. Client
> write replication should then be least prioritized.

One word of caution here: the health traffic should really be the 
same path and class of service as the inter-osd traffic, or else it 
will not identify failures.  e.g., if the health traffic is prioritized, 
and lower-priority traffic is starved/dropped, we won't notice.
 
> To support this I need our network equipment to perform the CoS job, and
> in order to do that at some level in the stack I need to be able to
> classify traffic. And furthermore, I'd like to do this with as little
> added state as possible.

I seem to recall a conversation a year or so ago about tagging 
stream/sockets so that the network layer could do this.  I don't think 
we got anywhere, though...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why FailedAssertion is not my favorite exception

2015-12-04 Thread Sage Weil

On Fri, 4 Dec 2015, Adam C. Emerson wrote:
> Noble Creators of the Squid Cybernetic Swimming in a Distributed Data Sea,
> 
> There is a spectre haunting src/common/assert.cc: The spectre of throw
> FailedAssertion.
> 
> This seemingly inconsequential yet villainous statement destroys the
> stack frame in which a failing assert statement is evaluated-- a stack
> frame of great interest to those hoping to divine the cause of such
> failures-- at the moment of their detection.
> 
> This consequence follows from the hope that some caller might be able to
> catch and recover from the failure. That is an unworthy goal, for any
> failure sufficiently dire to rate an 'assert' is a failure from which
> there can be no recovery. As I survey the code, I see FailedAssertion
> is only caught as part of unit tests and in a few small programs where
> it lead to an immediate exit.
> 
> Therefore! If there is no objection, I would like to submit a patch that
> will replace 'throw FailedException' with abort(). In support of this
> goal, the patch will also remove attempts to catch FailedException from
> driver programs like librados-config and change tests expecting a throw
> of FailedAssertion to use the EXPECT_DEATH or ASSERT_DEATH macros instead.
> 
> These changes, taken together, should be non-disruptive and make
> debugging easier.

Sounds good.  Feel free to replace the code the deliberately induces a 
segv with abort(), too... I wasn't aware of it when writing the original 
code.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Ceph update for -rc4

2015-12-04 Thread Sage Weil

Hi Linus,

Please pull the following fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This addresses a refcounting bug that leads to a use-after-free.

Thanks!
sage



Ilya Dryomov (1):
  rbd: don't put snap_context twice in rbd_queue_workfn()

 drivers/block/rbd.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ack vs commit

2015-12-03 Thread Sage Weil

>From the beginning Ceph has had two kinds of acks for rados write/update 
operations: ack (indicating the operation is accepted, serialized, and 
staged in the osd's buffer cache) and commit (indicating the write is 
durable).  The client, if it saw a failure on the OSD before getting the 
commit, would have to resend the write to the new or recovered OSD, 
attaching the version it got with the original ack so that the 
new/recovered OSD could preserve the order.  (This is what the 
replay_window pool property is for.)

The only user for this functionality is CephFS, and only when a file is 
opened for write or read/write by multiple clients.  When that happens, 
the clients do all read/write operations synchronously with the OSDs, and 
effectively things are ordered there, at the object.  The idea is/was to 
avoid penalizing this sort of IO by waiting for commits when the clients 
aren't actually asking for that--at least until they call f[data]sync, at 
which point the client will block and wait for the second commit replies 
to arrive.

In practice, this doesn't actually help: everybody runs the OSDs on XFS, 
and FileStore has to do write-ahead journaling, which means that these 
operations are committed and durable before they are readable, and the 
clients only ever see 'commit' (and not 'ack' followed by 'commit') 
(commit implies ack too).   If you run the OSD on btrfs, you might see 
some acks (journaling and writes to the file system/page cache are done in 
parallel).

With newstore/bluestore, the current implementation doesn't make any 
attempt to make a write readable before it is durable.  This could be 
changed, but.. I'm not sure it's worth it.  For most workloads (RBD, RGW) 
we only care about making writes durable as quickly as possible, and all 
of the OSD optimization efforts are focusing on making this as fast as 
possible.  Is there much marginal benefit to the initial ack?  It's only 
the CephFS clients with the same file open from multiple hosts that might 
see a small benefit.

On the other hand, there is a bunch of code in the OSD and on the 
client side to deal with the dual-ack behavior we could potentially drop.  
Also, we are generating lots of extra messages on the backend network for 
both ack and commit, even though the librados users don't usually care 
(the clients don't get the dual acks unless they request them).

Also, it is arguably Wrong and a Bad Thing that you could have client A 
write to file F, client B reads from file F and sees that data, the OSD 
and client A both fail, and when things recover client B re-reads the same 
portion of the file and sees the file content from before client A's 
change.  The MDS is extremely careful about this on the metadata side: no 
side-effects of one client are visible to any other client until they are 
durable, so that a combination MDS and client failure will never make 
things appear to go back in time.

Any opinions here?  My inclination is to remove the functionality (less 
code, less complexity, more sane semantics), but we'd be closing the door 
on what might have been a half-decent idea (separating serialization from 
durability when multiple clients have the same file open for 
read/write)...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD public / cluster network isolation using VRF:s

2015-12-03 Thread Sage Weil

On Thu, 3 Dec 2015, w...@42on.com wrote:
> Why all the trouble and complexity? I personally always try to avoid the 
> two networks and run with one. Also in large L3 envs.
> 
> I like the idea that one machine has one IP I have to monitor.
> 
> I would rethink about what a cluster network really adds. Imho it only 
> adds complexity.

FWIW I tend to agree.  There are probably some network deployments where 
it makes sense, but for most people I think it just adds complexity.  
Maybe it makes it easy to utilize dual interfaces, but my guess is you're 
better off bonding them if you can.

Note that on a largish cluster the public/client traffic is all 
north-south, while the backend traffic is also mostly north-south to the 
top-of-rack and then east-west.  I.e., within the rack, almost everything 
is north-south, and client and replication traffic don't look that 
different.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: queue_transaction interface + unique_ptr + performance

2015-12-03 Thread Sage Weil

On Thu, 3 Dec 2015, Casey Bodley wrote:
> > Well, yeah we are, it's just the actual Transaction structure which
> > wouldn't be dynamic -- the buffers and many other fields would still
> > hit the allocator.
> > -Sam
> 
> Sure. I was looking specifically at the tradeoffs between allocating
> and moving the Transaction object itself.
> 
> As it currently stands, the caller of ObjectStore can choose whether to
> allocate its Transactions on the heap, embed them in other objects, or
> put them on the stack for use with apply_transactions(). Switching to an
> interface built around unique_ptr forces all callers to use the heap. I'm
> advocating for an interface that doesn't.

That leaves us with either std::move or.. the raw Transaction* we have 
now.  Right?

> > > It's true that the move ctor has to do work. I counted 18 fields, half of
> > > which are integers, and the rest have move ctors themselves. But the cpu
> > > is good at integers. The win here is that you're not hitting the allocator
> > > in the fast path.

To be fair, many of these are also legacy that we can remove... possibly 
even now.  IIRC the only exposure to legacy encoded transactions (that use 
the tbl hackery) are journal items from an upgrade pre-hammer OSD that 
aren't flushed on upgrade.  We should have made the osd flush the journal 
before recording the 0_94_4 ondisk feature.  We could add another one to 
enforce that and rip all that code out now instead of waiting until 
after jewel... that would be satisfying (and I think an ondisk ceph-osd 
feature is enough here, then document that users should upgrade to 
hammer 0.94.6 or infernalis 9.2.1 before moving to jewel).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: queue_transaction interface + unique_ptr + performance

2015-12-03 Thread Sage Weil

1- I agree we should avoid shared_ptr whenever possible.

2- unique_ptr should not have any more overhead than a raw pointer--the 
compiler is enforcing the single-owner semantics.  See for example

https://msdn.microsoft.com/en-us/library/hh279676.aspx

"It is exactly is efficient as a raw pointer and can be used in STL 
containers."

Unless the implementation is broken somehow?  That seems unlikely...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

2015-12-01 Thread Sage Weil

Hi David,

On Tue, 1 Dec 2015, David Casier wrote:
> Hi Sage,
> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
> to create an ext4 FS with metadata on flash
> 
> Example with sdg1 on flash and sdb on hdd :
> 
> size_of() {
>   blockdev --getsize $1
> }
> 
> mkdmsetup() {
>   _ssd=/dev/$1
>   _hdd=/dev/$2
>   _size_of_ssd=$(size_of $_ssd)
>   echo """0 $_size_of_ssd linear $_ssd 0
>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
> }
> 
> mkdmsetup sdg1 sdb
> 
> mkfs.ext4 -O 
> ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
> $((1024*512)) /dev/mapper/dm-sdg1-sdb
> 
> With that, all meta_blocks are on the SSD
> 
> If omap are on SSD, there are almost no metadata on HDD
> 
> Consequence : performance Ceph (with hack on filestore without journal
> and directIO) are almost same that performance of the HDD.
> 
> With cache-tier, it's very cool !

Cool!  I know XFS lets you do that with the journal, but I'm not sure if 
you can push the fs metadata onto a different device too.. I'm guessing 
not?

> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel
> 
> With newstore, it's much more difficult to control the I/O profil.
> Because rocksDB embedded its own intelligence

This is coincidentally what I've been working on today.  So far I've just 
added the ability to put the rocksdb WAL on a second device, but it's 
super easy to push rocksdb data there as well (and have it spill over onto 
the larger, slower device if it fills up).  Or to put the rocksdb WAL on a 
third device (e.g., expensive NVMe or NVRAM).

See this ticket for the ceph-disk tooling that's needed:

http://tracker.ceph.com/issues/13942

I expect this will be more flexible and perform better than the ext4 
metadata option, but we'll need to test on your hardware to confirm!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Compiling for FreeBSD

2015-12-01 Thread Sage Weil

On Tue, 1 Dec 2015, Alan Somers wrote:
> On Tue, Dec 1, 2015 at 11:08 AM, Willem Jan Withagen  wrote:
> > On 1-12-2015 18:22, Alan Somers wrote:
> >>
> >> I did some work porting Ceph to FreeBSD, but got distracted and
> >> stopped about two years ago.  You may find this port useful, though it
> >> will probably need to be updated:
> >>
> >> https://people.freebsd.org/~asomers/ports/net/ceph/
> >
> >
> > I'll chcek that one as well...
> >
> >> Also, there's one major outstanding issue that I know of.  It breaks
> >> interoperability between FreeBSD and Linux Ceph nodes.  I posted a
> >> patch to fix it, but it doesn't look like it's been merged yet.
> >> http://tracker.ceph.com/issues/6636
> >
> >
> > In the issues I find:
> > 
> > Updated by Sage Weil almost 2 years ago
> >
> > Status changed from New to Verified
> > Updated by Sage Weil almost 2 years ago
> >
> > Assignee set to Noah Watkins
> > 
> >
> > Probably left at that point because there was no presure to actually commit?
> >
> > --WjW
> 
> It looks like Sage reviewed the change, but had some comments that
> were mostly style-related.  Neither Noah nor I actually got around to
> implementing Sage's suggestions.
> 
> https://github.com/ceph/ceph/pull/828

The uuid transition to boost::uuid has happened since then (a few months 
back) and I believe Rohan's AIX and Solaris ports for librados (that just 
merged) included a fix for the sockaddr_storage issue:

https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L180

and also

https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L160

?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CodingStyle on existing code

2015-12-01 Thread Sage Weil

On Tue, 1 Dec 2015, Wido den Hollander wrote:
> 
> On 01-12-15 16:00, Gregory Farnum wrote:
> > On Tue, Dec 1, 2015 at 5:47 AM, Loic Dachary  wrote:
> >>
> >>
> >> On 01/12/2015 14:10, Wido den Hollander wrote:
> >>> Hi,
> >>>
> >>> While working on mon/PGMonitor.cc I see that there is a lot of
> >>> inconsistency on the code.
> >>>
> >>> A lot of whitespaces, indentation which is not correct, well, a lot of
> >>> things.
> >>>
> >>> Is this something we want to fix? With some scripts we can probably do
> >>> this easily, but it might cause merge hell with people working on 
> >>> features.
> >>
> >> A sane (but long) way to do that is to cleanup when fixing a bug or adding 
> >> a feature. With (a lot) of patience, it will eventually be better :-)
> > 
> > Yeah, we generally want you to follow the standards in any new code. A
> > mass update of the code style on existing code makes navigating the
> > history a little harder so a lot of people don't like it much, though.
> 
> Understood. But in this case I'm working in PGMonitor.cc. For just 20
> lines of code I probably shouldn't refactor the whole file, should I?

Easiest thing is to fix the code around your change.

I'm also open to a wholesale cleanup since it's a low-traffic file and 
likely won't conflict with other stuff in flight.  But, up to you!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Compiling for FreeBSD

2015-12-01 Thread Sage Weil

On Tue, 1 Dec 2015, Willem Jan Withagen wrote:
> On 1-12-2015 14:30, Sage Weil wrote:
> > On Tue, 1 Dec 2015, Willem Jan Withagen wrote:
> > > On 30-11-2015 14:21, Sage Weil wrote:
> > > > The problem with all of the porting code in general is that it is doomed
> > > > to break later on if we don't have (at least) ongoing build tests.  In
> > > > order for a FreeBSD or OSX port to continue working we need VMs that run
> > > > either gitbuilder or a jenkins job or similar so that we can tell when
> > > > it
> > > > breaks.
> > > > 
> > > > If someone is willing to run a VM somewhere to do this we can pretty
> > > > easily stick it on the gitbuilder page at
> > > > 
> > > > http://ceph.com/gitbuilder.cgi
> > > 
> > > 
> > > Hi Sage,
> > > 
> > > Could you give some pointers as to where to start running the tests.
> > > I see a lot of "basic" tests to see if the platform is actually
> > > conformant.
> > > 
> > > So before plunging into running ceph-mon and stuff, it would perhaps be
> > > better to actually run (parts of) the basic required tests..
> > 
> > I would start with 'make check' from src/... that's what we'd actually
> > want the gitbuilder to do.
> 
> I was running that at the moment
> Found the suggestion on the developers pages, in the manual section.
> Sort of hidden at the bottom. :)
> 
> Did kill it in between, but now when I run it, it just only generates the
> report.
> So I just went make clean, which is rather too much...
> But could not really figure out the makefiles in test (yet)
> 
> How do I reset the test results?

I don't think there is anything to reset... just re-ru make check.  The 
exception is probably just if you hit control-c but it left running 
processes behind (./stop.sh should clean those up).

At least, that's the case on Linux.. maybe the (auto)tools are a bit 
different on *BSD?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Compiling for FreeBSD

2015-12-01 Thread Sage Weil

On Tue, 1 Dec 2015, Willem Jan Withagen wrote:
> On 30-11-2015 14:21, Sage Weil wrote:
> > The problem with all of the porting code in general is that it is doomed
> > to break later on if we don't have (at least) ongoing build tests.  In
> > order for a FreeBSD or OSX port to continue working we need VMs that run
> > either gitbuilder or a jenkins job or similar so that we can tell when it
> > breaks.
> > 
> > If someone is willing to run a VM somewhere to do this we can pretty
> > easily stick it on the gitbuilder page at
> > 
> > http://ceph.com/gitbuilder.cgi
> 
> 
> Hi Sage,
> 
> Could you give some pointers as to where to start running the tests.
> I see a lot of "basic" tests to see if the platform is actually conformant.
> 
> So before plunging into running ceph-mon and stuff, it would perhaps be
> better to actually run (parts of) the basic required tests..

I would start with 'make check' from src/... that's what we'd actually 
want the gitbuilder to do.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to open clog debug

2015-11-30 Thread Sage Weil

On Mon, 30 Nov 2015, Wukongming wrote:
> Hi, All
> 
> Does anyone know how to open clog debug?

It's usually something like

monc->clog.debug() << "hi there\n";

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Compiling for FreeBSD

2015-11-30 Thread Sage Weil

The problem with all of the porting code in general is that it is doomed 
to break later on if we don't have (at least) ongoing build tests.  In 
order for a FreeBSD or OSX port to continue working we need VMs that run 
either gitbuilder or a jenkins job or similar so that we can tell when it 
breaks.

If someone is willing to run a VM somewhere to do this we can pretty 
easily stick it on the gitbuilder page at

http://ceph.com/gitbuilder.cgi

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rgw/civetweb privileged port bind

2015-11-26 Thread Sage Weil

On Thu, 26 Nov 2015, Karol Mroz wrote:
> Hello,
> 
> As I understand it, with the release of infernalis, ceph
> daemons are no longer being run as root. Thus, rgw/civetweb
> is unable to bind to privileged ports:
> 
> http://tracker.ceph.com/issues/13600
> 
> We encountered this problem as well in our downstream (hammer
> based) product, where we run rgw/civetweb as "wwwuser". To allow
> privileged port binding, we used file caps (setcap from the spec file).
> Going forward, however, we were thinking of taking one of two
> approaches:
> 
> 1. Start rgw/civetweb as root and utilize an existing civetweb
> config option (run_as_user) to drop permissions _after_
> the port bind and after certificate files have been read.
>
> 2. Utilize systemd socket activation, and allow systemd to bind
> to the necessary port. Once rgw/civetweb is started, civetweb
> can pull the listening socket from systemd.
> 
> Is this something you folks upstream have given some thought to?

I haven't. #2 sounds like it's harder, and I'm not sure it brings a lot fo 
benefit. Making #1 work is probably super simple (replace our set user 
option with the civetweb one?)...

What do you suggest?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why my cluster become unavailable (min_size of pool)

2015-11-26 Thread Sage Weil

On Thu, 26 Nov 2015, hzwulibin wrote:
> Hi, Sage
> 
> I has a question about min_size of pool.
> 
> The default value of min_size is 2, but in this setting, when two OSDs 
> are down(mean two replicas lost) at same time, the IO will be blocked. 
> We want to set the min_size to 1 in our production environment as we 
> think it's normal case when two OSDs are down(sure on different host) at 
> same time.
> 
> So is there anypotential problem of this setting?

min_size = 1 is okay, but be aware that it will increase the risk of a 
situation of a pg history like

 epoch 10: osd.0, osd.1, osd.2
 epoch 11: osd.0   (1 and 2 down)
 epoch 12: - (osd.0 fails hard)
 epoch 13: osd.1 osd.2

i.e., a pg is serviced by a single osd for some period (possibly very 
short) and then fails permanently, and any writes during that period are 
*only* stored on that osd.  It'll require some manual recovery to get past 
it (mark that osd as lost, and accept that you may have lost some recent 
writes to the data).

sage



 

> 
> We use 0.80.10 version.
> 
> Thanks!
> 
> 
> -- 
> hzwulibin
> 2015-11-26
> 
> -----
> "hzwulibin"
> ?2015-11-23 09:00
> Sage Weil,Haomai Wang
> ???ceph-devel
> ???Re: why my cluster become unavailable
> 
> Hi, Sage
> 
> Thanks! Will try it when next testing!
> 
> --     
> hzwulibin
> 2015-11-23
> 
> -
> Sage Weil 
> ?2015-11-22 01:49
> Haomai Wang
> ???Libin Wu,ceph-devel
> ???Re: why my cluster become unavailable
> 
> On Sun, 22 Nov 2015, Haomai Wang wrote:
> > On Thu, Nov 19, 2015 at 11:26 PM, Libin Wu  wrote:
> > > Hi, cepher
> > >
> > > I have a cluster of 6 OSD server, every server has 8 OSDs.
> > >
> > > I out 4 OSDs on every server, then my client io is blocking.
> > >
> > > I reboot my client and then create a new rbd device, but the new
> > > device also can't write io.
> > >
> > > Yeah, i understand that some data may lost as threee replicas of some
> > > object were lost, but why the cluster become unavailable?
> > >
> > > There 80 incomplete pg and 4 down+incomplete pg.
> > >
> > > Any solution i could solve the problem?
> > 
> > Yes, if you doesn't have a special crushmap to control the data
> > replcement policy, pg will lack of necessary metadata to boot. If need
> > to readd outed osds or force remove pg which is incomplete(hope it's
> > just a test).
> 
> Is min_size 2 or 1?  Reducing it to 1 will generally clear some of the 
> incomplete pgs.  Just remember to raise it back to 2 after the cluster 
> recovers.
> 
> sage
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cache Tiering Investigation and Potential Patch

2015-11-25 Thread Sage Weil

On Wed, 25 Nov 2015, Nick Fisk wrote:
> > > Yes I think that should definitely be an improvement. I can't quite
> > > get my head around how it will perform in instances where you miss 1
> > > hitset but all others are a hit. Like this:
> > >
> > > H H H M H H H H H H H H
> > >
> > > And recency is set to 8 for example. It maybe that it doesn't have
> > > much effect on the overall performance. It might be that there is a
> > > strong separation of really hot blocks and hot blocks, but this could
> > > turn out to be a good thing.
> > 
> > Yeah... In the above case recency 3 would be enough (or 9, depending on
> > whether that's chronological or reverse chronological order).  Doing an N 
> > out
> > of M or similar is a bit more flexible and probably something we should add
> > on top.  (Or, we could change recency to be N/M instead of just
> > N.)
> 
> N out of M, is that similar to what I came up with but combined with the 
> N most recent sets?

Yeah

> If you can wait a couple of days I will run the PR 
> in its current state through my test box and see how it looks.

Sounds great, thanks.

> Just a quick question, is there a way to just make+build the changed 
> files/package or select just to build the main ceph.deb. I'm just using 
> " sudo dpkg-buildpackage" at the moment and its really slowing down any 
> testing I'm doing waiting for everything to rebuild.

You can probably 'make ceph-osd' and manualy copy that binary into 
place, assuming distro matches your build and test environments...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cache Tiering Investigation and Potential Patch

2015-11-25 Thread Sage Weil

On Wed, 25 Nov 2015, Nick Fisk wrote:
> Hi Sage
> 
> > -Original Message-
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: 25 November 2015 17:38
> > To: Nick Fisk 
> > Cc: 'ceph-users' ; ceph-devel@vger.kernel.org;
> > 'Mark Nelson' 
> > Subject: Re: Cache Tiering Investigation and Potential Patch
> > 
> > On Wed, 25 Nov 2015, Nick Fisk wrote:
> > > Presentation from the performance meeting.
> > >
> > > I seem to be unable to post to Ceph-devel, so can someone please
> > > repost there if useful.
> > 
> > Copying ceph-devel.  The problem is just that your email is HTML-formatted.
> > If you send it in plaintext vger won't reject it.
> 
> Right ok, let's see if this gets through. 
> 
> > 
> > > I will try and get a PR sorted, I realise that this change modifies
> > > the way the cache was originally designed but I think it provides a
> > > quick win for the performance increase involved. If there are plans
> > > for a better solution in time for the next release, then I would be
> > > really interested in working to that goal instead.
> > 
> > It's how it was intended/documented to work, so I think this falls in the 
> > 'bug
> > fix' category.  I did a quick PR here:
> > 
> > https://github.com/ceph/ceph/pull/6702
> > 
> > Does that look right?
> 
> Yes I think that should definitely be an improvement. I can't quite get 
> my head around how it will perform in instances where you miss 1 hitset 
> but all others are a hit. Like this:
> 
> H H H M H H H H H H H H
> 
> And recency is set to 8 for example. It maybe that it doesn't have much 
> effect on the overall performance. It might be that there is a strong 
> separation of really hot blocks and hot blocks, but this could turn out 
> to be a good thing.

Yeah... In the above case recency 3 would be enough (or 9, depending on 
whether that's chronological or reverse chronological order).  Doing an N 
out of M or similar is a bit more flexible and probably something we 
should add on top.  (Or, we could change recency to be N/M instead of just 
N.)
 
> Would it be useful for me to run all 3 versions (Old, this and mine) 
> through the same performance test I did before?

If you have time, sure!  At the very least it'd be great to see the new 
version go through the same test.

> Also I saw pull request 6623, is it still relevant to get the list order 
> right?

Oh right, I forgot about that one.  I'll incorporate that fix and then you 
can test that version.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cache Tiering Investigation and Potential Patch

2015-11-25 Thread Sage Weil

On Wed, 25 Nov 2015, Nick Fisk wrote:
> Presentation from the performance meeting.
> 
> I seem to be unable to post to Ceph-devel, so can someone please repost
> there if useful.

Copying ceph-devel.  The problem is just that your email is 
HTML-formatted. If you send it in plaintext vger won't reject it.

> I will try and get a PR sorted, I realise that this change modifies the way
> the cache was originally designed but I think it provides a quick win for
> the performance increase involved. If there are plans for a better solution
> in time for the next release, then I would be really interested in working
> to that goal instead.

It's how it was intended/documented to work, so I think this falls in the 
'bug fix' category.  I did a quick PR here:

https://github.com/ceph/ceph/pull/6702

Does that look right?

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cluster busy, cause heartbeat exceptional, cluster becomes more busy

2015-11-25 Thread Sage Weil

On Wed, 25 Nov 2015, Chenxiaowei wrote:
>  We met another serious problem as follows:
> 
> During backfill,rbd client send ops to cluster, slow request came up, and so
> 
> When osd heartbeat came in,  check cct->get_heartbeat_map()->is_healthy()
> return false,
> 
> So other osd will not receive heartbeat and report failure info to monitor,
> monitor mark osd down leading
> 
> to more osd peering, cluster more busy, so here comes the question:
> 
> why osd heartbeat check logic combined with heartbeatmap( check other
> threadpool and so on) ? ? ?
> 
> I am really confused about this logic. Wish your reply.

The idea is simply that if the OSD is not healthy (e.g., stuck op thread) 
it should not respond to heartbeats and tell other OSDs that it is 
healthy.  It should get marked down.  After it recovers 
(wait_for_healthy), then it can rejoin the cluster.  (Or, more likely, it 
the thread is completely stuck and it will suicide.)

I think the issue is that the backfill + client load was enough to make 
is_healthy() fail.. that really shouldn't be happening.  As long as the 
threads are making progress they won't fail their internal heartbeat 
checks--that only happens if they get completely stuck.  I suspect 
something else broke?

sage

Re: Fwd: [newstore (again)] how disable double write WAL

2015-11-24 Thread Sage Weil

On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote:
> Hello Vish,
> 
> Please apologize for the delay in my answer.
> Following the conversation you had with my colleague David, here are 
> some more details about our work :
> 
> We are working on Filestore / Newstore optimizations by studying how we 
> could set ourselves free from using the journal.
> 
> It is very important to work with SSD, but it is also mandatory to 
> combine it with regular magnetic platter disks. This is why we are 
> combining metadata storing on flash with data storing on disk.

This is pretty common, and something we will support natively with 
newstore.
 
> Our main goal is to have the control on performance. Which is quite 
> difficult with NewStore, and needs fundamental hacks with FileStore.

Can you clarify what you mean by "quite difficult with NewStore"?

FWIW, the latest bleeding edge code is currently at 
github.com/liewegas/wip-bluestore.

sage


> Is Samsung working on ARM boards with embedded flash and a SATA port, in 
> order to allow us to work on a hybrid approach? What is your line of 
> work with Ceph?
> 
> How can we work together ?
> 
> Regards,
> Sébastien
> 
> > Début du message réexpédié :
> > 
> > De: David Casier 
> > Date: 12 octobre 2015 20:52:26 UTC+2
> > À: Sage Weil , Ceph Development 
> > 
> > Cc: Sébastien VALSEMEY , 
> > benoit.lor...@aevoo.fr, Denis Saget , "luc.petetin" 
> > 
> > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > 
> > Ok,
> > Great.
> > 
> > With these  settings :
> > //
> > newstore_max_dir_size = 4096
> > newstore_sync_io = true
> > newstore_sync_transaction = true
> > newstore_sync_submit_transaction = true
> > newstore_sync_wal_apply = true
> > newstore_overlay_max = 0
> > //
> > 
> > And direct IO in the benchmark tool (fio)
> > 
> > I see that the HDD is 100% charged and there are notransfer of /db to 
> > /fragments after stopping benchmark : Great !
> > 
> > But when i launch a bench with random blocs of 256k, i see random blocs 
> > between 32k and 256k on HDD. Any idea ?
> > 
> > Debits to the HDD are about 8MBps when they could be higher with larger 
> > blocs (~30MBps)
> > And 70 MBps without fsync (hard drive cache disabled).
> > 
> > Other questions :
> > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> > fsync_wq) ?
> > newstore_sync_transaction -> true = sync in DB ?
> > newstore_sync_submit_transaction -> if false then kv_queue (only if 
> > newstore_sync_transaction=false) ?
> > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) ?
> > 
> > Is it true ?
> > 
> > Way for cache with battery (sync DB and no sync data) ?
> > 
> > Thanks for everything !
> > 
> > On 10/12/2015 03:01 PM, Sage Weil wrote:
> >> On Mon, 12 Oct 2015, David Casier wrote:
> >>> Hello everybody,
> >>> fragment is stored in rocksdb before being written to "/fragments" ?
> >>> I separed "/db" and "/fragments" but during the bench, everything is 
> >>> writing
> >>> to "/db"
> >>> I changed options "newstore_sync_*" without success.
> >>> 
> >>> Is there any way to write all metadata in "/db" and all data in 
> >>> "/fragments" ?
> >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> >> But if you are overwriting an existing object, doing write-ahead logging
> >> is usually unavoidable because we need to make the update atomic (and the
> >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> >> mitigates this somewhat for larger writes by limiting fragment size, but
> >> for small IOs this is pretty much always going to be the case.  For small
> >> IOs, though, putting things in db/ is generally better since we can
> >> combine many small ios into a single (rocksdb) journal/wal write.  And
> >> often leave them there (via the 'overlay' behavior).
> >> 
> >> sage
> >> 
> > 
> > 
> > -- 
> > 
> > 
> > Cordialement,
> > 
> > *David CASIER
> > DCConsulting SARL
> > 
> > 
> > 4 Trait d'Union
> > 77127 LIEUSAINT
> > 
> > **Ligne directe: _01 75 98 53 85_
> > Email: _david.casier@aevoo.fr_
> > *

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil

On Mon, 23 Nov 2015, Robert LeBlanc wrote:
> Thanks for the log dump command, I'll keep that in the back pocket, it
> would have been helpful in a few situations.
> 
> I'm trying to microbenchmark the new Weighted Round Robin queue I've
> been working on and just trying to dump the info to the logs so that I
> can see it at runtime. So this is in a branch that isn't published
> yet.
> 
> In file included from osd/OSD.cc:37:0:
> osd/OSD.h: In member function ?virtual void
> OSD::ShardedOpWQ::_process(uint32_t, ceph::heartbeat_handle_d*)?:
> osd/OSD.h:1072:7: error: invalid use of non-static data member ?OSD::whoami?
>int whoami;
>^
> osd/OSD.cc:8270:388: error: from this location
>dout(15) << "Wrr (" << dendl;

#undef
#define dout_prefix *_dout << "something: "

(whatever dout_perfix currently is for this code includes whoami... you're 
probably in a class other than OSD but still in OSD.cc.. move it to a 
different .cc file, or put it above the current class OSD stuff.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil

On Mon, 23 Nov 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Is there a way through the admin socket or inject args that can tell
> the OSD process to dump the in memory logs without crashing? Do you

Yep, 'ceph daemon osd.NN log dump'.

> have an idea of the overhead? From the code it looks like it is always
> evaluated, just depends on if it is stored in memory or dumped to
> disk. I'm trying to figure out an issue with dout() right now in the
> code I'm working on (invalid use of static member) and I'm trying to
> understand how it works.

What's the error and problematic line?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil

On Mon, 23 Nov 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> We set the debugging to 0/0, but are you talking about lines like:
> 
>-12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.133 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>-11> 2015-11-20 20:59:47.138749 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.136 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>-10> 2015-11-20 20:59:47.138751 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.139 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
> -9> 2015-11-20 20:59:47.138758 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.147 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
> -8> 2015-11-20 20:59:47.138761 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.159 since back 2015-11-20
> 20:58:51.427880 front 2015-11-20 20:58:51.427880 (cutoff 2015-11-20
> 20:59:27.138720)
> -7> 2015-11-20 20:59:47.138789 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.170 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
> -6> 2015-11-20 20:59:47.138794 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.175 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
> 
> There are 10,000 of those lines in the OSD log which shows all the
> logs up to the crash. Unless setting the value to 0/0 is eliminating
> what you are looking for. I've been wondering if setting it to 0/1 or
> 0/5 or even 0/20 has any runtime performance penalty? It seems like
> more detailed info on crashes would be helpful, but we don't want to
> write too much to the SATADOMs.

There is a performance impact but no disk IO (logs are accumulated in 
memory and only flushed out on a crash).

sage



> 
> We do have the NICs bonded all across our environment.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Mon, Nov 23, 2015 at 11:14 AM, Gregory Farnum  wrote:
> > On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc  wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> This is one of our production clusters which is dual 40 Gb Ethernet
> >> using VLANs for cluster and public networks. I don't think this is
> >> unusual, not like my dev cluster which runs Infiniband and IPoIB. The
> >> client nodes are connected at 10 GB Ethernet.
> >>
> >> I wonder if you are talking about the system logs, not the Ceph OSD
> >> logs. I'm attaching a snippet that includes the hour before and after.
> >
> > Nope, I meant the OSD logs. Whenever they crash, it should dump out
> > the last 1 in-memory log entries ? the one you sent along didn't
> > have a crash included at all. The exact system which timed out will
> > certainly be in those log entries (it's output at level 1, so unless
> > you manually turned everything to 0, it'll show up on a crash.)
> >
> > Anyway, I wouldn't expect that cluster config to have any issues with
> > a client dying since it's TCP over ethernet, but I have seen some
> > weird behaviors out of bonded NICs when one of them dies, so maybe.
> > -Greg
> >
> >> - 
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWU2LkCRDmVDuy+mK58QAA2EUP/22eOBNzAYDV5lGI4J9Z
> wnSZE39UycEfo8e6v8cfikLdAUT7fbY8HBq+VPylLo7OtxA+sGwgjrcz3hzu
> azRi9QuCeWNm+squPQpgISzXWnpDtSjlsA+7iQb+HJGW7/kcR+opixzMX/W5
> AE0Z/hrRwImw3r7Ze3Avl/j+l7iamUznfZAnaBdeWyle7Nge/D8kV+QJSeHe
> /zXDoWW8wPNiRwU/puJrH/GEzyYVZFZ4F9aPUKf9rXsp0chK5k55yysI8ABL
> CfBLtZ1yXPbD20knMdEyuQrDXWMGQplQ+7Z2qFAKsbp+qMFGNqeIbtA6xmbM
> +8RIXT5hTLmgH6lVLYFbk6wgiSphxTVFrkR4Bm6NzFHnloxZ3KuU1pqOZf2k
> iJZ8eDPfUxuforHO2L8TWMDWAsrqTm5A2u0GFtvm7uPWvxWo6sv08sq5IICD
> C75mnCRUIDGl/bQLxt06qvq7WwAtezwnNcwCth3kDFFS85WTgZGEtPgpFizt
> IpBQI4ustiT6lNmYQr6V2cj4HT1G8YBT1ykKwSYmsbRnT2PWGQc7IJ11DxgC
> E7i0c6UYcOMpWT18t+RTOzvv8AZGpna2X/xTJSPL2H10zIkiuXAwO/gZQ5oa
> mgN/3fdhcki8q7uWbZaBCNtv814sZIoTzQy7C7kApQdxFu+kbe5LHRhHZJbZ
> CExf
> =cjG0
> -END PGP SIGNATURE-
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.

v10.0.0 released

2015-11-23 Thread Sage Weil

This is the first development release for the Jewel cycle.  We are off to 
a good start, with lots of performance improvements flowing into the tree.  
We are targetting sometime in Q1 2016 for the final Jewel.

Notable Changes
---

* build: cmake tweaks (`pr#6254 <http://github.com/ceph/ceph/pull/6254>`_, John 
Spray)
* build: more CMake package check fixes (`pr#6108 
<http://github.com/ceph/ceph/pull/6108>`_, Daniel Gryniewicz)
* ceph-disk: get Nonetype when ceph-disk list with --format plain on single 
device. (`pr#6410 <http://github.com/ceph/ceph/pull/6410>`_, Vicente Cheng)
* ceph: fix tell behavior (`pr#6329 <http://github.com/ceph/ceph/pull/6329>`_, 
David Zafman)
* ceph-fuse: While starting ceph-fuse, start the log thread first (`issue#13443 
<http://tracker.ceph.com/issues/13443>`_, `pr#6224 
<http://github.com/ceph/ceph/pull/6224>`_, Wenjun Huang)
* client: don't mark_down on command reply (`pr#6204 
<http://github.com/ceph/ceph/pull/6204>`_, John Spray)
* client: drop prefix from ints (`pr#6275 
<http://github.com/ceph/ceph/pull/6275>`_, John Coyle)
* client: sys/file.h includes for flock operations (`pr#6282 
<http://github.com/ceph/ceph/pull/6282>`_, John Coyle)
* cls_rbd: change object_map_update to return 0 on success, add logging 
(`pr#6467 <http://github.com/ceph/ceph/pull/6467>`_, Douglas Fuller)
* cmake: Use uname instead of arch. (`pr#6358 
<http://github.com/ceph/ceph/pull/6358>`_, John Coyle)
* common: assert: __STRING macro is not defined by musl libc. (`pr#6210 
<http://github.com/ceph/ceph/pull/6210>`_, John Coyle)
* common: fix OpTracker age histogram calculation (`pr#5065 
<http://github.com/ceph/ceph/pull/5065>`_, Zhiqiang Wang)
* common/MemoryModel: Added explicit feature check for mallinfo(). (`pr#6252 
<http://github.com/ceph/ceph/pull/6252>`_, John Coyle)
* common/obj_bencher.cc: fix verification crashing when there's no objects 
(`pr#5853 <http://github.com/ceph/ceph/pull/5853>`_, Piotr DaÅek)
* common: optimize debug logging (`pr#6307 
<http://github.com/ceph/ceph/pull/6307>`_, Adam Kupczyk)
* common: Thread: move copy constructor and assignment op (`pr#5133 
<http://github.com/ceph/ceph/pull/5133>`_, Michal Jarzabek)
* common: WorkQueue: new PointerWQ base class for ContextWQ (`issue#13636 
<http://tracker.ceph.com/issues/13636>`_, `pr#6525 
<http://github.com/ceph/ceph/pull/6525>`_, Jason Dillaman)
* compat: use prefixed typeof extension (`pr#6216 
<http://github.com/ceph/ceph/pull/6216>`_, John Coyle)
* crush: validate bucket id before indexing buckets array (`issue#13477 
<http://tracker.ceph.com/issues/13477>`_, `pr#6246 
<http://github.com/ceph/ceph/pull/6246>`_, Sage Weil)
* doc: download GPG key from download.ceph.com (`issue#13603 
<http://tracker.ceph.com/issues/13603>`_, `pr#6384 
<http://github.com/ceph/ceph/pull/6384>`_, Ken Dreyer)
* doc: fix outdated content in cache tier (`pr#6272 
<http://github.com/ceph/ceph/pull/6272>`_, Yuan Zhou)
* doc/release-notes: v9.1.0 (`pr#6281 
<http://github.com/ceph/ceph/pull/6281>`_, Loic Dachary)
* doc/releases-notes: fix build error (`pr#6483 
<http://github.com/ceph/ceph/pull/6483>`_, Kefu Chai)
* doc: remove toctree items under Create CephFS (`pr#6241 
<http://github.com/ceph/ceph/pull/6241>`_, Jevon Qiao)
* doc: rename the "Create a Ceph User" section and add verbage aboutâ¦ 
(`issue#13502 <http://tracker.ceph.com/issues/13502>`_, `pr#6297 
<http://github.com/ceph/ceph/pull/6297>`_, ritz303)
* docs: Fix styling of newly added mirror docs (`pr#6127 
<http://github.com/ceph/ceph/pull/6127>`_, Wido den Hollander)
* doc, tests: update all http://ceph.com/ to download.ceph.com (`pr#6435 
<http://github.com/ceph/ceph/pull/6435>`_, Alfredo Deza)
* doc: update doc for with new pool settings (`pr#5951 
<http://github.com/ceph/ceph/pull/5951>`_, Guang Yang)
* doc: update radosgw-admin example (`pr#6256 
<http://github.com/ceph/ceph/pull/6256>`_, YankunLi)
* doc: update the OS recommendations for newer Ceph releases (`pr#6355 
<http://github.com/ceph/ceph/pull/6355>`_, ritz303)
* drop envz.h includes (`pr#6285 <http://github.com/ceph/ceph/pull/6285>`_, 
John Coyle)
* libcephfs: Improve portability by replacing loff_t type usage with off_t 
(`pr#6301 <http://github.com/ceph/ceph/pull/6301>`_, John Coyle)
* libcephfs: only check file offset on glibc platforms (`pr#6288 
<http://github.com/ceph/ceph/pull/6288>`_, John Coyle)
* librados: fix examples/librados/Makefile error. (`pr#6320 
<http://github.com/ceph/ceph/pull/6320>`_, You Ji)
* librados: init crush_location from config file. (`issue#13473 
<http://tracker.ceph.com/issues/13473>`_, `pr#6243 
<http://github.com/ceph/ceph/pull/6243>`_, Wei Luo)
* librados: wrongly passed in argument for stat command

Re: Crc32 Challenge

2015-11-23 Thread Sage Weil

On Mon, 23 Nov 2015, Gregory Farnum wrote:
> On Tue, Nov 17, 2015 at 10:51 AM, chris holcombe
>  wrote:
> > Hello Ceph Devs,
> >
> > I'm almost certain at this point that I have discovered a major bug in
> > ceph's crc32c mechanism.  http://tracker.ceph.com/issues/13713 I'm totally
> > open to be proven wrong and that's what this email is about.  Can someone
> > out there write a piece of code using an outside library that produces the
> > same crc32c checksums that Ceph does?  If they can I'll close my bug and
> > stand corrected :).  I've tried 3 python libraries and 1 rust library so far
> > and my conclusions are 1) they are all in agreement and 2) they all produce
> > different checksums than ceph's checksums
> > https://github.com/ceph/ceph/blob/83e10f7e2df0a71bd59e6ef2aa06b52b186fddaa/src/test/common/test_crc32c.cc#L21
> >
> > Start small and see if you can verify the "foo bar baz" checksum and then
> > try some of the others.
> >
> > For a known good checksum to test your program against use this:
> > http://www.pdl.cmu.edu/mailinglists/ips/mail/msg04970.html  In there Mark
> > Bakke talks about a 32 byte array of all 00h should produce a checksum of
> > 8A9136AA.  Printing that with python in decimal: 2324772522
> >
> > The implications of this are unfortunately tricky.  If I'm right and we fix
> > ceph's algorithm then it won't be able to talk to any previous version of
> > ceph past the beginning protocol handshake. There would have to be a
> > mechanism introduced so that any x and older version would speak the
> > previous crc and anything y and newer would speak the new version.  Another
> > option is we could break ceph's crc code out into a library and make that
> > available to everyone and call it ceph-crc32c.
> 
> I haven't checked the source for exactly where we use CRC32s, but I
> think the basic messenger protocol isn't checksummed ? we ought to be
> able to use the feature bits exchanged in the protocol handshake to
> decide which version of the crc to use?
> At least if it's worth changing; I've no idea about that.

The difference turned out to be a reasonable common convention of doing an 
xor ~0 with the final result.  We don't do that, and I don't think we 
should since it would (1) be a really painful change and (2) makes 
chaining together the xor values of multiple buffers more error-prone.  
So we're off the hook!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why my cluster become unavailable

2015-11-21 Thread Sage Weil

On Sun, 22 Nov 2015, Haomai Wang wrote:
> On Thu, Nov 19, 2015 at 11:26 PM, Libin Wu  wrote:
> > Hi, cepher
> >
> > I have a cluster of 6 OSD server, every server has 8 OSDs.
> >
> > I out 4 OSDs on every server, then my client io is blocking.
> >
> > I reboot my client and then create a new rbd device, but the new
> > device also can't write io.
> >
> > Yeah, i understand that some data may lost as threee replicas of some
> > object were lost, but why the cluster become unavailable?
> >
> > There 80 incomplete pg and 4 down+incomplete pg.
> >
> > Any solution i could solve the problem?
> 
> Yes, if you doesn't have a special crushmap to control the data
> replcement policy, pg will lack of necessary metadata to boot. If need
> to readd outed osds or force remove pg which is incomplete(hope it's
> just a test).

Is min_size 2 or 1?  Reducing it to 1 will generally clear some of the 
incomplete pgs.  Just remember to raise it back to 2 after the cluster 
recovers.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD replacement feature

2015-11-20 Thread Sage Weil

On Fri, 20 Nov 2015, Wei-Chung Cheng wrote:
> Hi Loic and cephers,
> 
> Sure, I have time to help (comment) on this feature replace a disk.
> This is a useful feature to handle disk failure :p
> 
> An simple step is described on http://tracker.ceph.com/issues/13732 :
> 1. set noout flag - if the broken osd is primary osd, could we handle well?
> 2. stop osd daemon and we need to wait the osd actually down. (or
> maybe use deactivate option with ceph-disk)
> 
> these two above step seems OK.
> about handle crush map, should we remove the broken osd out?
> If we do that, why we set noout flag? It still trigger re-balance
> after we remove osd from crushmap.

Right--I think you generally want to do either one or the other:

1) mark osd out, leave failed disk in place.  or, replace with new disk 
that re-uses the same osd id.

or,

2) remove osd from crush map.  replace with new disk (which gets new osd 
id).

I think re-using the osd id is awkward currently, so doing 1 and replacing 
the disk ends up moving data twice.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Aggregate failure report in ceph -s

2015-11-20 Thread Sage Weil

On Fri, 20 Nov 2015, Chen, Xiaoxi wrote:
> 
> Hi Sage,
> 
>    As we are looking at the failure detection part of ceph(basically
> around osd flipping issue), we  got some suggestion from customer that
> showing the aggregated failure report in ?ceph ?s?. The idea is:
> 
>   When an OSD find it cannot hear heartbeat from some of the peers, it
> will try to aggregate the failure domain, say ?I cannot reach all my peers
> in Rack C,    something wrong??  and this kind of log will be showed on ceph
> ?s.   So if we see ceph ?s and notice a lot of complain saying cannot reach
> Rack C, we will easily diagnose the Rack C has some network issue.
> 
>  
> 
>   Is that make sense?

Yeah, sounds reasonable to me!  It's a bit more awkward to do this at the 
mon level since rack C may talk to the mon, but doing it at the OSD makes 
sense.  There will be a lot of heuristics involved, though.  I expect the 
messages might include

- cannot reach _% of peers outside of my $crushlevel $foo [on front|back]
- cannot reach _% of hosts in $crushlevel $foo [on front|back]

?

Also note that it would be easiest to log these in the cluster log (ceph 
-w, not ceph -s).. I'm guessing that's what you mean?

Thanks!
sage

v0.80.11 Firefly released

2015-11-19 Thread Sage Weil

This is a bugfix release for Firefly.  This Firefly 0.80.x is nearing
its planned end of life in January 2016 it may also be the last.

We recommend that all Firefly users upgrade.

For more detailed information, see the complete changelog at

  http://docs.ceph.com/docs/master/_downloads/v0.80.11.txt

Notable Changes
---

* build/ops: /etc/init.d/radosgw restart does not work correctly (#11140, 
Dmitry Yatsushkevich)
* build/ops: Fix -Wno-format and -Werror=format-security options clash  
(#13417, Boris Ranto)
* build/ops: ceph-common needs python-argparse on older distros, but doesn't 
require it (#12034, Nathan Cutler)
* build/ops: ceph.spec.in running fdupes unnecessarily (#12301, Nathan Cutler)
* build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166, Nathan 
Cutler)
* build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build (#12351, 
Nathan Cutler)
* build/ops: fedora21 has junit, not junit4  (#10728, Ken Dreyer, Loic Dachary)
* build/ops: upstart: configuration is too generous on restarts (#11798, Sage 
Weil)
* common: Client admin socket leaks file descriptors (#11535, Jon Bernard)
* common: FileStore calls syncfs(2) even it is not supported (#12512, Danny 
Al-Gaaf, Kefu Chai, Jianpeng Ma)
* common: HeartBeat: include types (#13088, Sage Weil)
* common: Malformed JSON command output when non-ASCII strings are present  
(#7387, Kefu Chai, Tim Serong)
* common: Memory leak in Mutex.cc, pthread_mutexattr_init without 
pthread_mutexattr_destroy (#11762, Ketor Meng)
* common: Thread:pthread_attr_destroy(thread_attr) when done with it (#12570, 
Piotr DaÅ?ek, Zheng Qiankun)
* common: ThreadPool add/remove work queue methods not thread safe (#12662, 
Jason Dillaman)
* common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
* common: log: take mutex while opening fd (#12465, Samuel Just)
* common: recursive lock of md_config_t (0) (#12614, Josh Durgin)
* crush: take crashes due to invalid arg (#11602, Sage Weil)
* doc: backport v0.80.10 release notes to firefly (#11090, Loic Dachary, Sage 
Weil)
* doc: update docs to point to download.ceph.com (#13162, Alfredo Deza)
* fs: MDSMonitor: handle MDSBeacon messages properly (#11590, Kefu Chai)
* fs: client nonce collision due to unshared pid namespaces (#13032, Josh 
Durgin, Sage Weil)
* librbd: Objectcacher setting max object counts too low (#7385, Jason Dillaman)
* librbd: aio calls may block (#11056, Haomai Wang, Sage Weil, Jason Dillaman)
* librbd: internal.cc: 1967: FAILED assert(watchers.size() == 1) (#12176, Jason 
Dillaman)
* mon: Clock skew causes missing summary and confuses Calamari (#11877, 
Thorsten Behrens)
* mon: EC pools are not allowed as cache pools, disallow in the mon (#11650, 
Samuel Just)
* mon: Make it more difficult to delete pools in firefly (#11800, Sage Weil)
* mon: MonitorDBStore: get_next_key() only if prefix matches (#11786, Joao 
Eduardo Luis)
* mon: PaxosService: call post_refresh() instead of post_paxos_update() 
(#11470, Joao Eduardo Luis)
* mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai)
* mon: adding exsting pool as tier with --force-nonempty clobbers removed_snaps 
(#11493, Sage Weil, Samuel Just)
* mon: ceph fails to compile with boost 1.58 (#11576, Kefu Chai)
* mon: does not check for IO errors on every transaction (#13089, Sage Weil)
* mon: get pools health'info have error (#12402, renhwztetecs)
* mon: increase globalid default for firefly (#13255, Sage Weil)
* mon: pgmonitor: wrong at/near target maxâ?? reporting (#12401, huangjun)
* mon: register_new_pgs() should check ruleno instead of its index (#12210, 
Xinze Chi)
* mon: scrub error (osdmap encoding mismatch?) upgrading from 0.80 to ~0.80.2 
(#8815, #8674, #9064, Sage Weil, Zhiqiang Wang, Samuel Just)
* mon: the output is wrong when runing ceph osd reweight (#12251, Joao Eduardo 
Luis)
* objecter: can get stuck in redirect loop if osdmap epoch == 
last_force_op_resend (#11026, Jianpeng Ma, Sage Weil)
* objecter: pg listing can deadlock when throttling is in use (#9008, Guang 
Yang)
* objecter: resend linger ops on split (#9806, Josh Durgin, Samuel Just)
* osd: Cleanup boost optionals for boost 1.56 (#9983, William A. Kennington III)
* osd: LibRadosTwoPools[EC]PP.PromoteSnap failure (#10052, Sage Weil)
* osd: Mutex Assert from PipeConnection::try_get_pipe (#12437, David Zafman)
* osd: PG stuck with remapped (#9614, Guang Yang)
* osd: PG::handle_advance_map: on_pool_change after handling the map change 
(#12809, Samuel Just)
* osd: PGLog: split divergent priors as well (#11069, Samuel Just)
* osd: PGLog::proc_replica_log: correctly handle case where entries between 
olog.head and log.tail were split out (#11358, Samuel Just)
* osd: WBThrottle::clear_object: signal on cond when we reduce throttle values 
(#12223, Samuel Just)
* osd: cache full mode still skips young objects (#10006, Xinze Chi, Zhiqiang 
Wang)
* osd: crash creating/deleting pools (#12429, John Spray)
* osd: explicitly specify OSD f

Re: problem about pgmeta object?

2015-11-18 Thread Sage Weil

On Wed, 18 Nov 2015, Ning Yao wrote:
> Hi, Sage
> 
> pgmeta object is a meta-object (like __head___2) without
> significant information. It is created when in PG::_init() when
> handling pg_create and split_coll and always exits there during pg's
> life cycle until pg is removed in RemoveWQ. The real content related
> to pgmeta-data is stored in omap. we may just treat pgmeta-object as a
> logical object and not present it physically so that we can avoid
> recursive searching object path by if(hoid.pgmeta())?
> such as:
> int FileStore::_omap_setkeys(coll_t cid, const ghobject_t &hoid,
> const map &aset,
> const SequencerPosition &spos) {
>   dout(15) << __func__ << " " << cid << "/" << hoid << dendl;
>   Index index;
>   int r;
>   if(hoid.pgmeta())
> goto out;
> ***
> ***
> ***
> out:
>   r = object_map->set_keys(hoid, aset, &spos);
>   dout(20) << __func__ << " " << cid << "/" << hoid << " = " << r << dendl;
>   return r;
> }

This seems like a reasonable hack. We never store any byte data in it.  
And if/when that changes, we can change this at the same time.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CEPH] OSD daemons running with a large number of threads

2015-11-17 Thread Sage Weil

On Tue, 17 Nov 2015, ghislain.cheval...@orange.com wrote:
> Hi,
> 
> Context:
> Firefly 0.80.9
> Ubuntu 14.04.1
> Almost a production platform  in an openstack environment
> 176 OSD (SAS and SSD), 2 crushmap-oriented storage classes , 8 servers in 2 
> rooms, 3 monitors on openstack controllers
> Usage: Rados Gateway for object service and RBD as back-end for Cinder and 
> Glance
> 
> Issue:
> We are currently running performances tests on this cluster before turning it 
> to production.
> We created cinder volumes (attached to Ceph Back End) on virtual machines and 
> we use FIO to stress the cluster.
> A very large number of threads are created per OSD daemon (about 1000).

This is normal.  The init scripts set the max open files ulimit to a high 
value (usually 4194304 to avoid any possibility of hitting it) but you may 
need to set /proc/sys/kernel/pid_max to something big if your cluster is 
large.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Newly added monitor infinitely sync store

2015-11-16 Thread Sage Weil

On Mon, 16 Nov 2015, Guang Yang wrote:
> I spoke to a leveldb expert, it looks like this is a known pattern on
> LSM tree data structure - the tail latency for range scan could be far
> longer than avg/median since it might need to mmap several sst files
> to get the record.
> 
> Hi Sage,
> Do you see any harm to increase the default value for this setting
> (e.g. 20 minutes)? Or should I add the advise for monitor
> trouble-shooting?

The timeout is just for a round trip for the sync process, right?  I think 
increasing it a bit (2x or 3x?) is okay, but 20 minutes to do a single 
chunk is a lot.

The underlying problem in your cases is that your store is huge (by ~2 
orders of magnitude), so I'm not sure we should tune against that :)

sage


 > 
> Thanks,
> Guang
> 
> On Fri, Nov 13, 2015 at 9:07 PM, Guang Yang  wrote:
> > Thanks Sage! I will definitely try those patches.
> >
> > For this one, I finally managed to bring the new monitor in by
> > increasing the mon_sync_timeout from its default 60 to 6 to make
> > sure the syncing does not restart and result in an infinite loop..
> >
> > On Fri, Nov 13, 2015 at 5:04 PM, Sage Weil  wrote:
> >> On Fri, 13 Nov 2015, Guang Yang wrote:
> >>> Thanks Sage!
> >>>
> >>> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
> >>> > On Fri, 13 Nov 2015, Guang Yang wrote:
> >>> >> I was wrong the previous analysis, it was not the iterator got reset,
> >>> >> the problem I can see now, is that during the syncing, a new round of
> >>> >> election kicked off and thus it needs to probe the newly added
> >>> >> monitor, however, since it hasn't been synced yet, it will restart the
> >>> >> syncing from there.
> >>> >
> >>> > What version of this?  I think this is something we fixed a while back?
> >>> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
> >>> a commit I can take a look?
> >>
> >> Hrm, I guess it was way befoer that.. I'm thinking of
> >> b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm
> >> not sure exactly why an election would be restarting the sync in your
> >> case..
> >>
> >> You mentioned elsewhere that your mon store was very large, though (more
> >> than 10's of GB), which suggests you might be hitting the
> >> min_last_epoch_clean problem (which prevents osdmap trimming).. see
> >> b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer
> >> and firefly but not giant.
> >>
> >> sage
> >>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub randomization and load threshold

2015-11-16 Thread Sage Weil

On Mon, 16 Nov 2015, Dan van der Ster wrote:
> On Mon, Nov 16, 2015 at 4:58 PM, Dan van der Ster  wrote:
> > On Mon, Nov 16, 2015 at 4:32 PM, Dan van der Ster  
> > wrote:
> >> On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil  wrote:
> >>> On Mon, 16 Nov 2015, Dan van der Ster wrote:
> >>>> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
> >>>> the loadavg is decreasing (or below the threshold)? As long as the
> >>>> 1min loadavg is less than the 15min loadavg, we should be ok to allow
> >>>> new scrubs. If you agree I'll add the patch below to my PR.
> >>>
> >>> I like the simplicity of that, I'm afraid its going to just trigger a
> >>> feedback loop and oscillations on the host.  I.e., as soo as we see *any*
> >>> decrease, all osds on the host will start to scrub, which will push the
> >>> load up.  Once that round of PGs finish, the load will start to drop
> >>> again, triggering another round.  This'll happen regardless of whether
> >>> we're in the peak hours or not, and the high-level goal (IMO at least) is
> >>> to do scrubbing in non-peak hours.
> >>
> >> We checked our OSDs' 24hr loadavg plots today and found that the
> >> original idea of 0.8 * 24hr loadavg wouldn't leave many chances for
> >> scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable.
> >>
> >> BTW, I realized there was a silly error in that earlier patch, and we
> >> anyway need an upper bound, say # cpus. So until your response came I
> >> was working with this idea:
> >> https://stikked.web.cern.ch/stikked/view/raw/5586a912
> >
> > Sorry for SSO. Here:
> >
> > https://gist.github.com/dvanders/f3b08373af0f5957f589
> 
> Hi again. Here's a first shot at a daily loadavg heuristic:
> https://github.com/ceph/ceph/commit/15474124a183c7e92f457f836f7008a2813aa672
> I had to guess where it would be best to store the daily_loadavg
> member and where to initialize it... please advise.
> 
> I took the conservative approach of triggering scrubs when either:
>1m loadavg < osd_scrub_load_threshold, or
>1m loadavg < 24hr loadavg && 1m loadavg < 15m loadavg
> 
> The whole PR would become this:
> https://github.com/ceph/ceph/compare/master...cernceph:wip-deepscrub-daily

Looks reasonable to me!

I'm still a bit worried that the 1m < 15m thing will mean that on the 
completion of every scrub we have to wait ~1m before the next scrub 
starts.  Maybe that's okay, though... I'd say let's try this and adjust 
that later if it seems problematic (conservative == better).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub randomization and load threshold

2015-11-16 Thread Sage Weil

On Mon, 16 Nov 2015, Dan van der Ster wrote:
> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever
> the loadavg is decreasing (or below the threshold)? As long as the
> 1min loadavg is less than the 15min loadavg, we should be ok to allow
> new scrubs. If you agree I'll add the patch below to my PR.

I like the simplicity of that, I'm afraid its going to just trigger a 
feedback loop and oscillations on the host.  I.e., as soo as we see *any* 
decrease, all osds on the host will start to scrub, which will push the 
load up.  Once that round of PGs finish, the load will start to drop 
again, triggering another round.  This'll happen regardless of whether 
we're in the peak hours or not, and the high-level goal (IMO at least) is 
to do scrubbing in non-peak hours.

sage

> -- dan
> 
> 
> diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
> index 0562eed..464162d 100644
> --- a/src/osd/OSD.cc
> +++ b/src/osd/OSD.cc
> @@ -6065,20 +6065,24 @@ bool OSD::scrub_time_permit(utime_t now)
> 
>  bool OSD::scrub_load_below_threshold()
>  {
> -  double loadavgs[1];
> -  if (getloadavg(loadavgs, 1) != 1) {
> +  double loadavgs[3];
> +  if (getloadavg(loadavgs, 3) != 3) {
>  dout(10) << __func__ << " couldn't read loadavgs\n" << dendl;
>  return false;
>}
> 
>if (loadavgs[0] >= cct->_conf->osd_scrub_load_threshold) {
> -dout(20) << __func__ << " loadavg " << loadavgs[0]
> -<< " >= max " << cct->_conf->osd_scrub_load_threshold
> -<< " = no, load too high" << dendl;
> -return false;
> +if (loadavgs[0] >= loadavgs[2]) {
> +  dout(20) << __func__ << " loadavg " << loadavgs[0]
> +  << " >= max " << cct->_conf->osd_scrub_load_threshold
> +   << " and >= 15m avg " << loadavgs[2]
> +  << " = no, load too high" << dendl;
> +  return false;
> +}
>} else {
>  dout(20) << __func__ << " loadavg " << loadavgs[0]
>  << " < max " << cct->_conf->osd_scrub_load_threshold
> +<< " or < 15 min avg " << loadavgs[2]
>  << " = yes" << dendl;
>  return true;
>}
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: a problem about FileStore::_destroy_collection

2015-11-16 Thread Sage Weil

On Mon, 16 Nov 2015, yangruifeng.09...@h3c.com wrote:
> an ENOTEMPTY error mybe happen when removing a pg in previous 
> versions?but the error is hidden in new versions?

When did this change?

sage

> _destroy_collection maybe return 0 when get_index or prep_delete return < 0;
> 
> is this intended?
> 
> int FileStore::_destroy_collection(coll_t c) 
> {
>   int r = 0; //global r
>   char fn[PATH_MAX];
>   get_cdir(c, fn, sizeof(fn));
>   dout(15) << "_destroy_collection " << fn << dendl;
>   {
> Index from;
> int r = get_index(c, &from);//local r
> if (r < 0)
>   goto out;
> assert(NULL != from.index);
> RWLock::WLocker l((from.index)->access_lock);
> 
> r = from->prep_delete();
> if (r < 0)
>   goto out;
>   }
>   r = ::rmdir(fn);
>   if (r < 0) {
> r = -errno;
> goto out;
>   }
> 
>  out:
>   // destroy parallel temp collection, too
>   ...
> 
>  out_final:
>   dout(10) << "_destroy_collection " << fn << " = " << r << dendl;
>   return r;
> }
> Nyb?v?{.n??z??ayj???f:+v??zZ+??"?!?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Newly added monitor infinitely sync store

2015-11-13 Thread Sage Weil

On Fri, 13 Nov 2015, Guang Yang wrote:
> Thanks Sage!
> 
> On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil  wrote:
> > On Fri, 13 Nov 2015, Guang Yang wrote:
> >> I was wrong the previous analysis, it was not the iterator got reset,
> >> the problem I can see now, is that during the syncing, a new round of
> >> election kicked off and thus it needs to probe the newly added
> >> monitor, however, since it hasn't been synced yet, it will restart the
> >> syncing from there.
> >
> > What version of this?  I think this is something we fixed a while back?
> This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there
> a commit I can take a look?

Hrm, I guess it was way befoer that.. I'm thinking of 
b8af38b6fc161691d637631d9ce8ab84fb3d27c7 which was pre-firefly.  So I'm 
not sure exactly why an election would be restarting the sync in your 
case..

You mentioned elsewhere that your mon store was very large, though (more 
than 10's of GB), which suggests you might be hitting the 
min_last_epoch_clean problem (which prevents osdmap trimming).. see 
b41408302b6529a7856a3b0a08c35e5fa284882e.  This was backported to hammer 
and firefly but not giant.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Newly added monitor infinitely sync store

2015-11-13 Thread Sage Weil

On Fri, 13 Nov 2015, Guang Yang wrote:
> I was wrong the previous analysis, it was not the iterator got reset,
> the problem I can see now, is that during the syncing, a new round of
> election kicked off and thus it needs to probe the newly added
> monitor, however, since it hasn't been synced yet, it will restart the
> syncing from there.

What version of this?  I think this is something we fixed a while back?

> Hi Sage and Joao,
> Is there a way to freeze the election by some tunable to let the sync finish?

We can't not do elections when something is asking for one (e.g., mon 
is down).

sage



> 
> Thanks,
> Guang
> 
> On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang  wrote:
> > Hi Joao,
> > We have a problem when trying to add new monitors to the cluster on an
> > unhealthy cluster, which I would like ask for your suggestion.
> >
> > After adding the new monitor, it  started syncing the store and went
> > into an infinite loop:
> >
> > 2015-11-12 21:02:23.499510 7f1e8030e700 10
> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> > cookie 4513071120 lc 14697737 bl 929616 bytes last_key
> > osdmap,full_22530) v2
> > 2015-11-12 21:02:23.712944 7f1e8030e700 10
> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
> > cookie 4513071120 lc 14697737 bl 799897 bytes last_key
> > osdmap,full_3259) v2
> >
> >
> > We talked early in the morning on IRC, and at the time I thought it
> > was because the osdmap epoch was increasing, which lead to this
> > infinite loop.
> >
> > I then set those nobackfill/norecovery flags and the osdmap epoch
> > freezed, however, the problem is still there.
> >
> > While the osdmap epoch is 22531, the switch always happened at
> > osdmap.full_22530 (as showed by the above log).
> >
> > Looking at the code at both sides, it looks this check
> > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
> > always true, and I can confirm from the log that (sp.last_commited <
> > paxos->get_version()) was false, so the chance is that the
> > sp.synchronizer always has next chunk?
> >
> > Does this look familiar to you? Or any other trouble shoot I can try?
> > Thanks very much.
> >
> > Thanks,
> > Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Ceph changes for -rc1

2015-11-13 Thread Sage Weil

Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are several patches from Ilya fixing RBD allocation lifecycle 
issues, a series adding a nocephx_sign_messages option (and associated bug 
fixes/cleanups), several patches from Zheng improving the (directory) 
fsync behavior, a big improvement in IO for direct-io requests when 
striping is enabled from Caifeng, and several other small fixes and 
cleanups.

Thanks!
sage


Arnd Bergmann (1):
  ceph: fix message length computation

Geliang Tang (1):
  ceph: fix a comment typo

Ilya Dryomov (10):
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  rbd: don't free rbd_dev outside of the release callback
  rbd: set device_type::release instead of device::release
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  libceph: introduce ceph_x_authorizer_cleanup()
  libceph: msg signing callouts don't need con argument
  libceph: drop authorizer check from cephx msg signing routines
  libceph: stop duplicating client fields in messenger
  libceph: add nocephx_sign_messages option
  libceph: clear msg->con in ceph_msg_release() only

Ioana Ciornei (1):
  libceph: evaluate osd_req_op_data() arguments only once

Julia Lawall (1):
  rbd: drop null test before destroy functions

Shraddha Barke (2):
  libceph: remove con argument in handle_reply()
  libceph: use local variable cursor instead of &msg->cursor

Yan, Zheng (3):
  ceph: don't invalidate page cache when inode is no longer used
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  ceph: make fsync() wait unsafe requests that created/modified inode

Zhu, Caifeng (1):
  ceph: combine as many iovec as possile into one OSD request

 drivers/block/rbd.c| 109 -
 fs/ceph/cache.c|   2 +-
 fs/ceph/caps.c |  76 ++--
 fs/ceph/file.c |  87 
 fs/ceph/inode.c|   1 +
 fs/ceph/mds_client.c   |  57 +++--
 fs/ceph/mds_client.h   |   3 ++
 fs/ceph/super.h|   1 +
 include/linux/ceph/libceph.h   |   4 +-
 include/linux/ceph/messenger.h |  16 ++
 net/ceph/auth_x.c  |  36 +-
 net/ceph/ceph_common.c |  18 +--
 net/ceph/crypto.h  |   4 +-
 net/ceph/messenger.c   |  88 ++---
 net/ceph/osd_client.c  |  34 ++---
 15 files changed, 314 insertions(+), 222 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Notes from a discussion a design to allow EC overwrites

2015-11-13 Thread Sage Weil

On Thu, 12 Nov 2015, Samuel Just wrote:
> I was present for a discussion about allowing EC overwrites and thought it
> would be good to summarize it for the list:
> 
> Commit Protocol:
> 1) client sends write to primary
> 2) primary reads in partial stripes needed for partial stripe
> overwrites from replicas
> 3) primary sends prepares to participating replicas and queues its own
> prepare locally
> 4) once all prepares are complete, primary sends a commit to the client
> 5) primary sends applies to all participating replicas
> 
> When we get the prepare, we write out a temp object with the data to be
> written.  On apply, we use an objectstore primitive to atomically move those
> extents into the actual object.  The log entry contains the name/id for the
> temp object so it can be applied on apply or removed on rollback.

Currently we assume that temp objects are/can be cleared out on restart.  
This will need to change.  And we'll need to be careful that they get 
cleaned out when peering completes (and the rollforward/rollback decision 
is made.

If the stripes are small, then the objectstore primitive may not actually 
be that efficient.  I'd suggest also hinting that the temp object will be 
swapped later, so that the backend can, if it's small, store it in a cheap 
temporary location in the expectation that it will get rewritten later.  
(In particular, the newstore allocation chunk is currently targetting 
512kb, and this will only be efficient with narrow stripes, so it'll just 
get double-written.  We'll want to keep the temp value in the kv store 
[log, hopefully] and not bother to allocate disk and rewrite it.)

> Each log entry contains a list of the shard ids modified.  During peering, we
> use the same protocol for choosing the authoritative log for the existing EC
> pool, except that we first take the longest candidate log and use it to extend
> shorter logs until they hit an entry they should have witnessed, but didn't.
> 
> Implicit in the above scheme is the fact that if an object is written, but a
> particular shard isn't changed, the osd with that shard will have a copy of 
> the
> object with the correct data, but an out of date object_into (notably, the
> version will be old).  To avoid this messing up the missing set during the log
> scan on OSD start, we'll skip log entries we wouldn't have participated in (we
> may not even choose to store them, see below).  This does generally pose a
> challenge for maintaining prior_version.  It seems like it shouldn't be much 
> of
> a problem since rollbacks can only happen on prepared log entries which 
> haven't
> been applied, so log merging can never result in a divergent entry causing a
> missing object.  I think we can get by without it then?
> 
> We can go further with the above and actually never persist a log entry on a
> shard which it did not participate in.  As long as peering works correctly, 
> the
> union of the logs we got must have all entries.  The primary will need to keep
> a complete copy in memory to manage dup op detection and recovery, however.

That sounds more complex to me.  Maybe instead we could lazily persist the 
entries (on the next pg write) so that it is always a contiguous sequence?

> 2) above can also be done much more efficiently.  Some codes will allow the
> parity chunks to be reconstructed more efficiently, so some thought will
> have to go into restructuring the plugin interface to allow more efficient

Hopefully the stripe size is chosen such that most writes will end up 
being full stripe writes (we should figure out if the EC performance 
degrades significantly in that case?).

An alternative would be to do something like

1) client sends write to primary
2) primary sends prepare to the first M+1 shards, who write it in a 
temporary object/location
3) primary acks write once they ack
4) asynchronously, primary recalculates the affected stripes and 
sends an overwrite.

 - step 4 doesn't need to be 2-phase, since we have the original data 
persisted already on >M shards
 - the client-observed latency is bounded by only M+1 OSDs (not 
acting.size())

I suspect you discussed this option, though, and have other concerns 
around its complexity?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data-at-rest compression

2015-11-13 Thread Sage Weil

On Fri, 13 Nov 2015, Alyona Kiselyova wrote:
> Hi,
> I was working on pluggable compression interface in this work
> (https://github.com/ceph/ceph/pull/6361). In Igor's pull request was
> suggested to reuse  common plugin infrastructure from unmerged
> wip-plugin branch. Now I'm working on adaptation of it, and as I see,
> I need only this two commits from it
> (https://github.com/ceph/ceph/commit/294bef3d12ec04d9febf1f850184be7653a4322cand
> https://github.com/ceph/ceph/commit/18ad8df1094db52c839dc6b2dc689fc882230acb).
> Sage, is it possible to make standalone pull request with them, or I
> must just to cherry pick them in my branch?

Let's do a separate PR.

Thanks!
sage


> Thanks for answer.
> ---
> Best regards,
> Alyona Kiseleva
> 
> 
> On Tue, Nov 10, 2015 at 6:46 PM, Igor Fedotov  wrote:
> > Hi All,
> >
> > a while ago we had some conversations here about adding compression support
> > for EC pools.
> > Here is corresponding pull request implementing this feature:
> >
> > https://github.com/ceph/ceph/pull/6524/commits
> >
> > Appropriate blueprint is at:
> > http://tracker.ceph.com/projects/ceph/wiki/Rados_-_at-rest_compression
> >
> > All comments and reviews are highly appreciated.
> >
> > Thanks,
> > Igor.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub randomization and load threshold

2015-11-12 Thread Sage Weil

On Thu, 12 Nov 2015, Dan van der Ster wrote:
> On Thu, Nov 12, 2015 at 2:29 PM, Sage Weil  wrote:
> > On Thu, 12 Nov 2015, Dan van der Ster wrote:
> >> Hi,
> >>
> >> Firstly, we just had a look at the new
> >> osd_scrub_interval_randomize_ratio option and found that it doesn't
> >> really solve the deep scrubbing problem. Given the default options,
> >>
> >> osd_scrub_min_interval = 60*60*24
> >> osd_scrub_max_interval = 7*60*60*24
> >> osd_scrub_interval_randomize_ratio = 0.5
> >> osd_deep_scrub_interval = 60*60*24*7
> >>
> >> we understand that the new option changes the min interval to the
> >> range 1-1.5 days. However, this doesn't do anything for the thundering
> >> herd of deep scrubs which will happen every 7 days. We've found a
> >> configuration that should randomize deep scrubbing across two weeks,
> >> e.g.:
> >>
> >> osd_scrub_min_interval = 60*60*24*7
> >> osd_scrub_max_interval = 100*60*60*24 // effectively disabling this option
> >> osd_scrub_load_threshold = 10 // effectively disabling this option
> >> osd_scrub_interval_randomize_ratio = 2.0
> >> osd_deep_scrub_interval = 60*60*24*7
> >>
> >> but that (a) doesn't allow shallow scrubs to run daily and (b) is so
> >> far off the defaults that its basically an abuse of the intended
> >> behaviour.
> >>
> >> So we'd like to simplify how deep scrubbing can be randomized. Our PR
> >> (http://github.com/ceph/ceph/pull/6550) adds a new option
> >> osd_deep_scrub_randomize_ratio which  controls a coin flip to randomly
> >> turn scrubs into deep scrubs. The default is tuned so roughly 1 in 7
> >> scrubs will be run deeply.
> >
> > The coin flip seems reasonable to me.  But wouldn't it also/instead make
> > sense to apply the randomize ratio to the deep_scrub_interval?  My just
> > adding in the random factor here:
> >
> > https://github.com/ceph/ceph/pull/6550/files#diff-dfb9ddca0a3ee32b266623e8fa489626R3247
> >
> > That is what I would have expected to happen, and if the coin flip is also
> > there then you have two knobs controlling the same thing, which'll cause
> > confusion...
> >
> 
> That was our first idea. But that has a couple downsides:
> 
>   1.  If we use the random range for the deep scrub intervals, e.g.
> deep every 1-1.5 weeks, we still get quite bursty scrubbing until it
> randomizes over a period of many weeks/months. And I fear it might
> even lead to lower frequency harmonics of many concurrent deep scrubs.
> Using a coin flip guarantees uniformity starting immediately from time
> zero.
>
>   2. In our PR osd_deep_scrub_interval is still used as an upper limit
> on how long a PG can go without being deeply scrubbed. This way
> there's no confusion such as PGs going undeep-scrubbed longer than
> expected. (In general, I think this random range is unintuitive and
> difficult to tune (e.g. see my 2 week deep scrubbing config above).

Fair enough..
 
> For me, the most intuitive configuration (maintaining randomness) would be:
> 
>   a. drop the osd_scrub_interval_randomize_ratio because there is no
> shallow scrub thundering herd problem (AFAIK), and it just complicates
> the configuration. (But this is in a stable release now so I don't
> know if you want to back it out).

I'm inclined to leave it, even if it complicates config: just because we 
haven't noticed the shallow scrub thundering herd doesn't mean it doesn't 
exist, and I fully expect that it is there.  Also, if the shallow scrubs 
are lumpy and we're promoting some of them to deep scrubs, then the deep 
scrubs will be lumpy too.

>   b. perform a (usually shallow) scrub every
> osd_scrub_interval_(min/max) depending on a self-tuning load
> threshold.

Yep, although as you note we have some work to do to get there.  :)

>   c. do a coin flip each (b) to occasionally turn it into deep scrub.

Works for me.

>   optionally: d. remove osd_deep_scrub_randomize_ratio and replace it
> with  osd_scrub_interval_min/osd_deep_scrub_interval.

There is no osd_deep_scrub_randomize_ratio.  Do you mean replace 
osd_deep_scrub_interval with osd_deep_scrub_{min,max}_interval?

> >> Secondly, we'd also like to discuss the osd_scrub_load_threshold
> >> option, where we see two problems:
> >>- the default is so low that it disables all the shallow scrub
> >> randomization on all but completely idle clusters.
> >>- finding the correct osd_scrub_load_threshold for a cluster is
> >> surely unclear/difficult and probably

RE: [CEPH][Crush][Tunables] issue when updating tunables

2015-11-12 Thread Sage Weil

On Thu, 12 Nov 2015, ghislain.cheval...@orange.com wrote:
> Hi Sage,
> Thanks for the reply
> 
> You said 
> " You actually want straw_calc_version 1.  This is just confusing output from 
> the 'firefly' tunable detection... the straw_calc_version does not have any 
> client dependencies."
> 
> My objective is to have the most relevant tunables for a firefly platform.
> 
> I didn't understand if :
> - it's better to have straw_calc_version set to 1 but  tunables_optimal will 
> be automatically set to 0.
> In other words are the following tunables OK ?
> { "choose_local_tries": 0,
> >   "choose_local_fallback_tries": 0,
> >   "choose_total_tries": 50,
> >   "chooseleaf_descend_once": 1,
> >   "chooseleaf_vary_r": 1,
> >   "straw_calc_version": 0,
> >   "profile": "firefly",
> >   "optimal_tunables": 1,
> >   "legacy_tunables": 0,
> >   "require_feature_tunables": 1,
> >   "require_feature_tunables2": 1,
> >   "require_feature_tunables3": 1,
> >   "has_v2_rules": 0,
> >   "has_v3_rules": 0} 

It is best to also manually set straw_calc_version = 1.  It won't say 
firefly, but it will still be compatible with firefly clients (that 
tunable only affects the mon behavior when updating the crush map).

> - there's an issue with the tunables detection and update?

Yes and no.  We want 'ceph osd crush tunables firefly' to set options 
supported by all firefly deployments, and the initial firefly releases did 
not have the straw_calc_version = 1 support.  In some cases switching it 
on can trigger some data movement the next time the crush map is adjusted, 
so we leave it off to be conservative.  And we want the profile to match 
exactly what setting the profile sets.  But it's confusing since it isn't 
1:1 with what clients support.  And if it is a fresh cluster you are 
better off with straw_calc_version = 1.  (Same goes for old clusters, if 
you can tolerate a bit of initial rebalancing.)

sage



> 
> Best regards 
> 
> -Message d'origine-
> De : Sage Weil [mailto:s...@newdream.net] 
> Envoyé : mardi 10 novembre 2015 11:23
> À : CHEVALIER Ghislain IMT/OLPS
> Cc : ceph-devel@vger.kernel.org
> Objet : Re: [CEPH][Crush][Tunables] issue when updating tunables
> 
> On Tue, 10 Nov 2015, ghislain.cheval...@orange.com wrote:
> > Hi all,
> > 
> > Context:
> > Firefly 0.80.9
> > Ubuntu 14.04.1
> > Almost a production platform  in an openstack environment
> > 176 OSD (SAS and SSD), 2 crushmap-oriented storage classes , 8 servers 
> > in 2 rooms, 3 monitors on openstack controllers
> > Usage: Rados Gateway for object service and RBD as back-end for Cinder 
> > and Glance
> > 
> > The Ceph cluster was installed by Mirantis procedures 
> > (puppet/fuel/ceph-deploy):
> > 
> > I noticed that tunables were curiously set.
> > ceph  osd crush show-tunables ==>
> > { "choose_local_tries": 0,
> >   "choose_local_fallback_tries": 0,
> >   "choose_total_tries": 50,
> >   "chooseleaf_descend_once": 1,
> >   "chooseleaf_vary_r": 1,
> >   "straw_calc_version": 1,
> >   "profile": "unknown",
> >   "optimal_tunables": 0,
> >   "legacy_tunables": 0,
> >   "require_feature_tunables": 1,
> >   "require_feature_tunables2": 1,
> >   "require_feature_tunables3": 1,
> >   "has_v2_rules": 0,
> >   "has_v3_rules": 0}
> > 
> > I tried to update them
> > ceph  osd crush tunables optimal ==>
> > adjusted tunables profile to optimal
> > 
> > But when checking
> > ceph  osd crush show-tunables ==>
> > { "choose_local_tries": 0,
> >   "choose_local_fallback_tries": 0,
> >   "choose_total_tries": 50,
> >   "chooseleaf_descend_once": 1,
> >   "chooseleaf_vary_r": 1,
> >   "straw_calc_version": 1,
> >   "profile": "unknown",
> >   "optimal_tunables": 0,
> >   "legacy_tunables": 0,
> >   "require_feature_tunables": 1,
> >   "require_feature_tunables2": 1,
> >   "require_feature_tunables3": 1,
> >   "has_v2_rules": 0,
> >   "has_v3_rules": 0}
> >

Re: scrub randomization and load threshold

2015-11-12 Thread Sage Weil

On Thu, 12 Nov 2015, Dan van der Ster wrote:
> Hi,
> 
> Firstly, we just had a look at the new
> osd_scrub_interval_randomize_ratio option and found that it doesn't
> really solve the deep scrubbing problem. Given the default options,
> 
> osd_scrub_min_interval = 60*60*24
> osd_scrub_max_interval = 7*60*60*24
> osd_scrub_interval_randomize_ratio = 0.5
> osd_deep_scrub_interval = 60*60*24*7
> 
> we understand that the new option changes the min interval to the
> range 1-1.5 days. However, this doesn't do anything for the thundering
> herd of deep scrubs which will happen every 7 days. We've found a
> configuration that should randomize deep scrubbing across two weeks,
> e.g.:
> 
> osd_scrub_min_interval = 60*60*24*7
> osd_scrub_max_interval = 100*60*60*24 // effectively disabling this option
> osd_scrub_load_threshold = 10 // effectively disabling this option
> osd_scrub_interval_randomize_ratio = 2.0
> osd_deep_scrub_interval = 60*60*24*7
> 
> but that (a) doesn't allow shallow scrubs to run daily and (b) is so
> far off the defaults that its basically an abuse of the intended
> behaviour.
> 
> So we'd like to simplify how deep scrubbing can be randomized. Our PR
> (http://github.com/ceph/ceph/pull/6550) adds a new option
> osd_deep_scrub_randomize_ratio which  controls a coin flip to randomly
> turn scrubs into deep scrubs. The default is tuned so roughly 1 in 7
> scrubs will be run deeply.

The coin flip seems reasonable to me.  But wouldn't it also/instead make 
sense to apply the randomize ratio to the deep_scrub_interval?  My just 
adding in the random factor here:

https://github.com/ceph/ceph/pull/6550/files#diff-dfb9ddca0a3ee32b266623e8fa489626R3247

That is what I would have expected to happen, and if the coin flip is also 
there then you have two knobs controlling the same thing, which'll cause 
confusion...

> Secondly, we'd also like to discuss the osd_scrub_load_threshold
> option, where we see two problems:
>- the default is so low that it disables all the shallow scrub
> randomization on all but completely idle clusters.
>- finding the correct osd_scrub_load_threshold for a cluster is
> surely unclear/difficult and probably a moving target for most prod
> clusters.
> 
> Given those observations, IMHO the smart Ceph admin should set
> osd_scrub_load_threshold = 10 or higher, to effectively disable that
> functionality. In the spirit of having good defaults, I therefore
> propose that we increase the default osd_scrub_load_threshold (to at
> least 5.0) and consider removing the load threshold logic completely.

This sounds reasonable to me.  It would be great if we could use a 24-hour 
average as the baseline or something so that it was self-tuning (e.g., set 
threshold to .8 of daily average), but that's a bit trickier.  Generally 
all for self-tuning, though... too many knobs...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

merge commits reminder

2015-11-11 Thread Sage Weil

Just a reminder: we'd like to generate the release changelog from the 
merge commits.  Whenever merging a pull request, please remember to:

 - edit the first line to be what will will appear in the changelog.  
Prefix it with the subsystem and give it a short, meaningful description.  
 - if the bug isn't mentioned in any of commits (Fixes: #1234), add it to 
the merge commit.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disabling buffer::raw crc cache

2015-11-11 Thread Sage Weil

On Wed, 11 Nov 2015, Ning Yao wrote:
> 2015-11-11 21:13 GMT+08:00 Sage Weil :
> > On Wed, 11 Nov 2015, Ning Yao wrote:
> >> >>>the code logic would touch crc cache is bufferlist::crc32c and 
> >> >>>invalidate_crc.
> >> >>Also for pg_log::_write_log(), but seems it is always miss and use at
> >> >>once, no need to cache crc actually?
> >> > Oh, no, it will be hit in FileJournal writing
> >> Still miss as buffer::ptr length diff with ::encode(crc, bl), right?
> >> So the previous ebl.crc32c(0) calculation would be also no need to
> >> cache.
> >
> > How about just skipping the cache logic if the raw length is less than
> > some threshold?  Say, 16KB or something?  That would cover the _write_log
> > case (small buffer) and more generally avoid the fixed overhead of caching
> > when recalculating is cheap.
> >
> > This was originally added with large writes in mind to avoid the crc
> > recalculation during journaling on armv7l.  It is presumably also helping
> > now that we have the opportunistic whole-object checksums for full or
> > sequential writes.
> 
> Reasonable,  and we may also reconsider whether
> map,  > is really needed. The common
> case seems always use the same offset and length. Is it better for us
> to just cache the last crc result (only one item is enough efficient)?

Yeah, or maybe a fixed size array with a #define size of maybe 2 or 3.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new scrub and repair discussion

2015-11-11 Thread Sage Weil

On Wed, 11 Nov 2015, kefu chai wrote:
> currently, scrub and repair are pretty primitive. there are several
> improvements which need to be made:
> 
> - user should be able to initialize scrub of a PG or an object
> - int scrub(pg_t, AioCompletion*)
> - int scrub(const string& pool, const string& nspace, const
> string& locator, const string& oid, AioCompletion*)
> - we need a way to query the result of the most recent scrub on a pg.
> - int get_inconsistent_pools(set* pools);
> - int get_inconsistent_pgs(uint64_t pool, paged* pgs);
> - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
> paged*)

What is paged<>?

> - the user should be able to query the content of the replica/shard
> objects in the event of an inconsistency.
> - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
> ObjectReadOperation *op, bool allow_inconsistent)

This is exposing a bunch of internal types (pg_t, pg_shard_t, epoch_t) up 
through librados.  We might want to consider making them strings or just 
unsigned or similar?  I'm mostly worried about making it hard for us to 
change the types later...

> - the user should be able to perform following fixes using a new
> aio_operate_scrub(
>   const std::string& oid,
>   shard_id_t shard,
>   AioCompletion *c,
>   ObjectWriteOperation *op)
> - specify which replica to use for repairing a content inconsistency
> - delete an object if it can't exist
> - write_full
> - omap_set
> - setattrs

For omap_set and setattrs do we want a _full-type equivalent, or would we 
support partial changes?  Partial updates won't necessary resolve an 
inconsistency, but I think (?) in the ec case the full xattr set is in 
the log event?

> - the user should be able to repair snapset and object_info_t
> - ObjectWriteOperation::repair_snapset(...)
> - set/remove any property/attributes, for example,
> - to reset snapset.clone_overlap
> - to set snapset.clone_size
> - to reset the digests in object_info_t,
> - repair will create a new version so that possibly corrupted copies
> on down OSDs will get fixed naturally.
> 
> so librados will offer enough information and facilities, with which a
> smart librados client/script will be able to fix the inconsistencies
> found in the scrub.
> 
> as an example, if we run into a data inconsistency where the 3
> replicas failed to agree with each other after performing a deep
> scrub. probably we'd like to have an election to get the auth copy.
> following pseudo code explains how we will implement this using the
> new rados APIs for scrub and repair.
> 
>  # something is not necessarily better than nothing
>  rados.aio_scrub(pg, completion)
>  completion.wait_for_complete()
>  for pool in rados.get_inconsistent_pools():
>   for pg in rados.get_inconsistent_pgs(pool):
># rados.get_inconsistent_pgs() throws if "epoch" expires
> 
>for oid, inconsistent in rados.get_inconsistent_pgs(pg,
> epoch).items():
> if inconsistent.is_data_digest_mismatch():
>  votes = defaultdict(int)
>  for osd, shard_info in inconsistent.shards:
>   votes[shard_info.object_info.data_digest] += 1
>  digest, _ = mavotes, key=operator.itemgetter(1))
>  auth_copy = None
>  for osd, shard_info in inconsistent.shards.items():
>   if shard_info.object_info.data_digest == digest:
>auth_copy = osd
>break
>  repair_op = librados.ObjectWriteOperation()
>  repair_op.repair_pick(auth_copy,
> inconsistent.ver, epoch)
>  rados.aio_operate_scrub(oid, repair_op)
> 
> this plan was also discussed in the infernalis CDS. see
> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.

We should definitely make sure these are surfaced in the python bindings 
from the start.  :)

Sounds good to me!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: disabling buffer::raw crc cache

2015-11-11 Thread Sage Weil

On Wed, 11 Nov 2015, Ning Yao wrote:
> >>>the code logic would touch crc cache is bufferlist::crc32c and 
> >>>invalidate_crc.
> >>Also for pg_log::_write_log(), but seems it is always miss and use at
> >>once, no need to cache crc actually?
> > Oh, no, it will be hit in FileJournal writing
> Still miss as buffer::ptr length diff with ::encode(crc, bl), right?
> So the previous ebl.crc32c(0) calculation would be also no need to
> cache.

How about just skipping the cache logic if the raw length is less than 
some threshold?  Say, 16KB or something?  That would cover the _write_log 
case (small buffer) and more generally avoid the fixed overhead of caching 
when recalculating is cheap.

This was originally added with large writes in mind to avoid the crc 
recalculation during journaling on armv7l.  It is presumably also helping 
now that we have the opportunistic whole-object checksums for full or 
sequential writes.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CEPH][Crush][Tunables] issue when updating tunables

2015-11-10 Thread Sage Weil

On Tue, 10 Nov 2015, ghislain.cheval...@orange.com wrote:
> Hi all,
> 
> Context:
> Firefly 0.80.9
> Ubuntu 14.04.1
> Almost a production platform  in an openstack environment
> 176 OSD (SAS and SSD), 2 crushmap-oriented storage classes , 8 servers in 2 
> rooms, 3 monitors on openstack controllers
> Usage: Rados Gateway for object service and RBD as back-end for Cinder and 
> Glance
> 
> The Ceph cluster was installed by Mirantis procedures 
> (puppet/fuel/ceph-deploy):
> 
> I noticed that tunables were curiously set.
> ceph  osd crush show-tunables ==>
> { "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 1,
>   "chooseleaf_vary_r": 1,
>   "straw_calc_version": 1,
>   "profile": "unknown",
>   "optimal_tunables": 0,
>   "legacy_tunables": 0,
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 1,
>   "require_feature_tunables3": 1,
>   "has_v2_rules": 0,
>   "has_v3_rules": 0}
> 
> I tried to update them
> ceph  osd crush tunables optimal ==>
> adjusted tunables profile to optimal
> 
> But when checking
> ceph  osd crush show-tunables ==>
> { "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 1,
>   "chooseleaf_vary_r": 1,
>   "straw_calc_version": 1,
>   "profile": "unknown",
>   "optimal_tunables": 0,
>   "legacy_tunables": 0,
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 1,
>   "require_feature_tunables3": 1,
>   "has_v2_rules": 0,
>   "has_v3_rules": 0}
> 
> Nothing has changed.
> 
> I finally did
> ceph osd crush set-tunable straw_calc_version 0

You actually want straw_calc_version 1.  This is just confusing output 
from the 'firefly' tunable detection... the straw_calc_version does not 
have any client dependencies.

sage


> 
> and
> ceph  osd crush show-tunables ==>
> { "choose_local_tries": 0,
>   "choose_local_fallback_tries": 0,
>   "choose_total_tries": 50,
>   "chooseleaf_descend_once": 1,
>   "chooseleaf_vary_r": 1,
>   "straw_calc_version": 0,
>   "profile": "firefly",
>   "optimal_tunables": 1,
>   "legacy_tunables": 0,
>   "require_feature_tunables": 1,
>   "require_feature_tunables2": 1,
>   "require_feature_tunables3": 1,
>   "has_v2_rules": 0,
>   "has_v3_rules": 0}
> 
> It's OK
> 
> My question:
> Does the "ceph osd crush tunables " command change all the requested 
> parameters in order to set the tunables to the right profile?
> 
> Brgds
> 
> _
> 
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> France Telecom - Orange decline toute responsabilite si ce message a ete 
> altere, deforme ou falsifie. Merci
> 
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorization.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, France Telecom - Orange shall not be liable if this 
> message was modified, changed or falsified.
> Thank you.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

There is no next; only jewel

2015-11-09 Thread Sage Weil

Hey everyone,

Just a reminder that now that infernalis is out and we're back to focusing 
on jewel, we should send all bug fixes to the 'jewel' branch (which 
functions the same way the old 'next' branch did).  That is,

 bug fixes -> jewel
 new features -> master

Every dev release (hopefully we'll get back on a 2 week shedule) we'll 
slurp master into jewel for the next sprint.  And during each sprint we'll 
test/stabilize the jewel branch.

Expect feature freeze to be February-ish.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Sage Weil

On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> Hmm I didn't use ceph-disk but partitioned & format by myself and 
> call ceph-osd --mkfs directly, that should be the reason why udev rules 
> doesn't make effect?

Yeah... the udev rule is based on the GPT partition label.  For example,

https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules#L4-L5

sage



> 
> > -Original Message-
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: Monday, November 9, 2015 9:18 PM
> > To: Chen, Xiaoxi
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: Cannot start osd due to permission of journal raw device
> > 
> > On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> > > There is no such rules (only 70-persistent-net.rules) in my
> > > /etc/udev/ruled.d/
> > >
> > > Could you point me which part of the code create the rules file? Is
> > > that ceph-disk?
> > 
> > https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules
> > 
> > The package should install it in /lib/udev/rules.d or similar...
> > 
> > sage
> > 
> > > > -Original Message-
> > > > From: Sage Weil [mailto:s...@newdream.net]
> > > > Sent: Friday, November 6, 2015 6:33 PM
> > > > To: Chen, Xiaoxi
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: Re: Cannot start osd due to permission of journal raw
> > > > device
> > > >
> > > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> > > > > Hi,
> > > > > I tried  infernalis (version 9.1.0
> > > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to
> > > > permission of journal ,  the OSD  was upgraded from hammer(also true
> > > > for newly created OSD).
> > > > >   I am using raw device as journal, this is because the default
> > > > > privilege of
> > > > raw block is root:disk. Changing the journal owner to ceph:ceph
> > > > solve the issue. Seems we can either:
> > > > >   1. add ceph to "disk" group and run ceph-osd with --setuser ceph
> > > > > --
> > > > setgroup disk?
> > > > >   2. Require user to set the ownership of journal device to
> > > > > ceph:ceph is they
> > > > want to use raw as journal?  Maybe we can done this in ceph-disk.
> > > > >
> > > > >Personally I would prefer the second one , what do you think?
> > > >
> > > > The udev rules should be setting the jouranl device ownership to
> > ceph:ceph.
> > > > IIRC there was a race in ceph-disk that could prevent this from
> > > > happening in some cases but that is now fixed.  Can you try the 
> > > > infernalis
> > branch?
> > > >
> > > > sage
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph encoding optimization

2015-11-09 Thread Sage Weil

On Mon, 9 Nov 2015, Gregory Farnum wrote:
> On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnum  wrote:
> > The problem with this approach is that the encoded versions need to be
> > platform-independent ? they are shared over the wire and written to
> > disks that might get transplanted to different machines. Apart from
> > padding bytes, we also need to worry about endianness of the machine,
> > etc. *And* we often mutate structures across versions in order to add
> > new abilities, relying on the encode-decode process to deal with any
> > changes to the system. How could we deal with that if just dumping the
> > raw memory?
> >
> > Now, maybe we could make these changes on some carefully-selected
> > structs, I'm not sure. But we'd need a way to pick them out, guarantee
> > that we aren't breaking interoperability concerns, etc; and it would
> > need to be something we can maintain as a group going forward. I'm not
> > sure how to satisfy those constraints without burning a little extra
> > CPU. :/
> > -Greg
> 
> So it turns out we've actually had issues with this. Sage merged
> (wrote?) some little-endian-only optimizations to the cephx code that
> broke big-endian systems by doing a direct memcpy. Apparently our
> tests don't find these issues, which makes me even more nervous about
> taking that sort of optimization into the tree. :(

I think the way to make this maintainable will be to

1) Find a clean approach with a simple #if or #ifdef condition for 
little endian and/or architectures that can handle unaligned int pointer 
access.

2) Maintain the parallel optimized implementation next to the generic 
encode/decode in a way that makes it as easy as possible to make changes 
and keep them in sync.

3) Optimize *only* the most recent encoding to minimize complexity.

4) Ensure that there is a set of encode/decode tests that verify they both 
work, triggered by make check (so that a simple make check on a big 
endian box will catch errors).  Ideally this'd be part of the 
test/encoding/readable.sh so that we run it over the entire corpus of old 
encodings..


sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help on ext4/xattr linux kernel stability issue / ceph xattr use?

2015-11-09 Thread Sage Weil

On Mon, 9 Nov 2015, Laurent GUERBY wrote:
> Hi,
> 
> Part of our ceph cluster is using ext4 and we recently hit major kernel
> instability in the form of kernel lockups every few hours, issues
> opened:
> 
> http://tracker.ceph.com/issues/13662
> https://bugzilla.kernel.org/show_bug.cgi?id=107301
> 
> On kernel.org kernel developpers are asking about ceph usage of xattr,
> in particular wether there are lots of common xattr key/value or wether
> they are all differents.
> 
> I attached a file with various xattr -l outputs:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=107301#c8
> https://bugzilla.kernel.org/attachment.cgi?id=192491
> 
> Looks like the "big" xattr "user.ceph._" is always different, same for
> the intermediate size "user.ceph.hinfo_key".
> 
> "user.cephos.spill_out" and "user.ceph.snapset" seem to have small
> values, and within a small value set.
> 
> Our cluster is used exclusively for virtual machines block devices with
> rbd, on replicated (3) and erasure coded pools (4+1 and 8+2).
> 
> Could someone knowledgeable add some information on ceph use of xattr in
> the kernel.org bugzilla above?

The above is all correct.  The mbcache (didn't know that existed!) is 
definitely not going to be useful here.
 
> Also I think it is necessary to warn ceph users to avoid ext4 at all
> costs until this kernel/ceph issue is sorted out: we went from
> relatively stable production for more than a year to crashes everywhere
> all the time since two weeks ago, probably after hitting some magic
> limit. We migrated our machines to ubuntu trusty, our SSD based
> filesystem to XFS but our HDD are still mostly on ext4 (60 TB
> of data to move so not that easy...).

Was there a ceph upgrade in there somewhere?  The size of the user.ceph._ 
xattr has increased over time, and (somewhat) recently crossed the 255 
byte threshold (on average) which also triggered a performance regression 
on XFS...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Sage Weil

On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> There is no such rules (only 70-persistent-net.rules) in my /etc/udev/ruled.d/
> 
> Could you point me which part of the code create the rules file? Is that 
> ceph-disk?

https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules

The package should install it in /lib/udev/rules.d or similar...

sage

> > -Original Message-----
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: Friday, November 6, 2015 6:33 PM
> > To: Chen, Xiaoxi
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: Cannot start osd due to permission of journal raw device
> > 
> > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> > > Hi,
> > > I tried  infernalis (version 9.1.0
> > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to permission
> > of journal ,  the OSD  was upgraded from hammer(also true for newly
> > created OSD).
> > >   I am using raw device as journal, this is because the default privilege 
> > > of
> > raw block is root:disk. Changing the journal owner to ceph:ceph solve the
> > issue. Seems we can either:
> > >   1. add ceph to "disk" group and run ceph-osd with --setuser ceph --
> > setgroup disk?
> > >   2. Require user to set the ownership of journal device to ceph:ceph is 
> > > they
> > want to use raw as journal?  Maybe we can done this in ceph-disk.
> > >
> > >Personally I would prefer the second one , what do you think?
> > 
> > The udev rules should be setting the jouranl device ownership to ceph:ceph.
> > IIRC there was a race in ceph-disk that could prevent this from happening in
> > some cases but that is now fixed.  Can you try the infernalis branch?
> > 
> > sage
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph encoding optimization

2015-11-08 Thread Sage Weil

On Sat, 7 Nov 2015, Haomai Wang wrote:
> Hi sage,
> 
> Could we know about your progress to refactor MSubOP and hobject_t,
> pg_stat_t decode problem?
> 
> We could work on this based on your work if any.

See Piotr's last email on this thead... it has Josh's patch attached.

sage


> 
> 
> On Thu, Nov 5, 2015 at 1:29 AM, Haomai Wang  wrote:
> > On Thu, Nov 5, 2015 at 1:19 AM, piotr.da...@ts.fujitsu.com
> >  wrote:
> >>> -Original Message-
> >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> >>> ow...@vger.kernel.org] On Behalf Of ???
> >>> Sent: Wednesday, November 04, 2015 4:34 PM
> >>> To: Gregory Farnum
> >>> Cc: ceph-devel@vger.kernel.org
> >>> Subject: Re: ceph encoding optimization
> >>>
> >>> I agree with pg_stat_t (and friends) is a good first start.
> >>> The eversion_t and utime_t are also good choice to start because they are
> >>> used at many places.
> >>
> >> On Ceph Hackathon, Josh Durgin made initial steps in right direction in 
> >> terms of pg_stat_t encoding and decoding optimization, with the 
> >> endianness-awareness thing left out. Even in that state, performance 
> >> improvements offered by this change were huge enough to make it 
> >> worthwhile. I'm attaching the patch, but please note that this is 
> >> prototype and based on mid-August state of code, so you might need to take 
> >> that into account when applying the patch.
> >
> > Cool, it's exactly we want to see.
> >
> >>
> >>
> >> With best regards / Pozdrawiam
> >> Piotr Da?ek
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-08 Thread Sage Weil

On Fri, 6 Nov 2015, Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> After trying to look through the recovery code, I'm getting the
> feeling that recovery OPs are not scheduled in the OP queue that I've
> been working on. Does that sound right? In the OSD logs I'm only
> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
> If the recovery is in another separate queue, then there is no
> reliable way to prioritize OPs between them.
> 
> If I'm going off in to the weeds, please help me get back on the trail.

Yeah, the recovery work isn't in the unified queue yet.

sage



> 
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
> >>> -BEGIN PGP SIGNED MESSAGE-
> >>> Hash: SHA256
> >>>
> >>> Thanks Gregory,
> >>>
> >>> People are most likely busy and haven't had time to digest this and I
> >>> may be expecting more excitement from it (I'm excited due to the
> >>> results and probably also that such a large change still works). I'll
> >>> keep working towards a PR, this was mostly proof of concept, now that
> >>> there is some data I'll clean up the code.
> >>
> >> I'm *very* excited about this.  This is something that almost every
> >> operator has problems with so it's very encouraging to see that switching
> >> up the queue has a big impact in your environment.
> >>
> >> I'm just following up on this after a week of travel, so apologies if this
> >> is covered already, but did you compare this implementation to the
> >> original one with the same tunables?  I see somewhere that you had
> >> max_backfills=20 at some point, which is going to be bad regardless of the
> >> queue.
> >>
> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
> >> implementation.
> >
> > Yes max_backfills=20 is problematic for both queues and from what I
> > can tell is because the OPs are waiting for PGs to get healthy. In a
> > busy cluster it can take a while due to the recovery ops having low
> > priority. In the current queue, it is possible to be blocked for a
> > long time. The new queue seems to prevent that, but they do still back
> > up. After this, I think I'd like to look into promoting recovery OPs
> > that are blocking client OPs to higher priorities so that client I/O
> > doesn't suffer as much during recovery. I think that will be a very
> > different problem to tackle because I don't think I can do the proper
> > introspection at the queue level. I'll have to do that logic in OSD.cc
> > or PG.cc.
> >
> > The strict priority threshold didn't make much of a difference with
> > the original queue. I initially eliminated it all together in the WRR,
> > but there were times that peering would never complete. I want to get
> > as many OPs in the WRR queue to provide fairness as much as possible.
> > I haven't tweaked the setting much in the WRR queue yet.
> >
> >>
> >>> I was thinking that a config option to choose the scheduler would be a
> >>> good idea. In terms of the project what is the better approach: create
> >>> a new template and each place the template class is instantiated
> >>> select the queue, or perform the queue selection in the same template
> >>> class, or something else I haven't thought of.
> >>
> >> A config option would be nice, but I'd start by just cleaning up the code
> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
> >> whatever).  If we find that it's behaving better I'm not sure how much
> >> value we get from a tunable.  Note that there is one other user
> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
> >> point.. especially if this implementation is faster.
> >>
> >> Once it's cleaned up (remove commented out code, new class) put it up as a
> >> PR and we can review and get it through testing.
> >
> > In talking with Samuel in IRC, we think creating an abstract class f

v9.2.0 Infernalis released

2015-11-06 Thread Sage Weil

ceph user will currently get a dynamically assigned UID when the
  user is created.

  If your systems already have a ceph user, upgrading the package will cause
  problems.  We suggest you first remove or rename the existing 'ceph' user
  and 'ceph' group before upgrading.

  When upgrading, administrators have two options:

   1. Add the following line to ``ceph.conf`` on all hosts::

setuser match path = /var/lib/ceph/$type/$cluster-$id

  This will make the Ceph daemons run as root (i.e., not drop
  privileges and switch to user ceph) if the daemon's data
  directory is still owned by root.  Newly deployed daemons will
  be created with data owned by user ceph and will run with
  reduced privileges, but upgraded daemons will continue to run as
  root.

   2. Fix the data ownership during the upgrade.  This is the
  preferred option, but it is more work and can be very time
  consuming.  The process for each host is to:

  1. Upgrade the ceph package.  This creates the ceph user and group.  For
 example::

   ceph-deploy install --stable infernalis HOST

  2. Stop the daemon(s).::

   service ceph stop   # fedora, centos, rhel, debian
   stop ceph-all   # ubuntu

  3. Fix the ownership::

   chown -R ceph:ceph /var/lib/ceph

  4. Restart the daemon(s).::

   start ceph-all# ubuntu
   systemctl start ceph.target   # debian, centos, fedora, rhel

  Alternatively, the same process can be done with a single daemon
  type, for example by stopping only monitors and chowning only
  ``/var/lib/ceph/mon``.

* The on-disk format for the experimental KeyValueStore OSD backend has
  changed.  You will need to remove any OSDs using that backend before you
  upgrade any test clusters that use it.

* When a pool quota is reached, librados operations now block indefinitely,
  the same way they do when the cluster fills up.  (Previously they would return
  -ENOSPC).  By default, a full cluster or pool will now block.  If your
  librados application can handle ENOSPC or EDQUOT errors gracefully, you can
  get error returns instead by using the new librados OPERATION_FULL_TRY flag.

* The return code for librbd's rbd_aio_read and Image::aio_read API methods no
  longer returns the number of bytes read upon success.  Instead, it returns 0
  upon success and a negative value upon failure.

* 'ceph scrub', 'ceph compact' and 'ceph sync force are now DEPRECATED.  Users
  should instead use 'ceph mon scrub', 'ceph mon compact' and
  'ceph mon sync force'.

* 'ceph mon_metadata' should now be used as 'ceph mon metadata'. There is no
  need to deprecate this command (same major release since it was first
  introduced).

* The `--dump-json` option of "osdmaptool" is replaced by `--dump json`.

* The commands of "pg ls-by-{pool,primary,osd}" and "pg ls" now take 
"recovering"
  instead of "recovery", to include the recovering pgs in the listed pgs.

Notable Changes since Hammer
----

* aarch64: add optimized version of crc32c (Yazen Ghannam, Steve Capper)
* auth: cache/reuse crypto lib key objects, optimize msg signature check (Sage 
Weil)
* auth: reinit NSS after fork() (#11128 Yan, Zheng)
* autotools: fix out of tree build (Krxysztof Kosinski)
* autotools: improve make check output (Loic Dachary)
* buffer: add invalidate_crc() (Piotr Dalek)
* buffer: fix zero bug (#12252 Haomai Wang)
* buffer: some cleanup (Michal Jarzabek)
* build: allow tcmalloc-minimal (Thorsten Behrens)
* build: C++11 now supported
* build: cmake: fix nss linking (Danny Al-Gaaf)
* build: cmake: misc fixes (Orit Wasserman, Casey Bodley)
* build: disable LTTNG by default (#11333 Josh Durgin)
* build: do not build ceph-dencoder with tcmalloc (#10691 Boris Ranto)
* build: fix junit detection on Fedora 22 (Ira Cooper)
* build: fix pg ref disabling (William A. Kennington III)
* build: fix ppc build (James Page)
* build: install-deps: misc fixes (Loic Dachary)
* build: install-deps.sh improvements (Loic Dachary)
* build: install-deps: support OpenSUSE (Loic Dachary)
* build: make_dist_tarball.sh (Sage Weil)
* build: many cmake improvements
* build: misc cmake fixes (Matt Benjamin)
* build: misc fixes (Boris Ranto, Ken Dreyer, Owen Synge)
* build: OSX build fixes (Yan, Zheng)
* build: remove rest-bench
* ceph-authtool: fix return code on error (Gerhard Muntingh)
* ceph-detect-init: added Linux Mint (Michal Jarzabek)
* ceph-detect-init: robust init system detection (Owen Synge)
* ceph-disk: ensure 'zap' only operates on a full disk (#11272 Loic Dachary)
* ceph-disk: fix zap sgdisk invocation (Owen Synge, Thorsten Behrens)
* ceph-disk: follow ceph-osd hints when creating journal (#9580 Sage Weil)
* ceph-disk: handle re-using ex

Re: Cannot start osd due to permission of journal raw device

2015-11-06 Thread Sage Weil

On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> Hi,
> I tried  infernalis (version 9.1.0 
> (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to permission of 
> journal ,  the OSD  was upgraded from hammer(also true for newly created OSD).
>   I am using raw device as journal, this is because the default privilege of 
> raw block is root:disk. Changing the journal owner to ceph:ceph solve the 
> issue. Seems we can either: 
>   1. add ceph to "disk" group and run ceph-osd with --setuser ceph --setgroup 
> disk?
>   2. Require user to set the ownership of journal device to ceph:ceph is they 
> want to use raw as journal?  Maybe we can done this in ceph-disk.
> 
>Personally I would prefer the second one , what do you think?

The udev rules should be setting the jouranl device ownership to 
ceph:ceph.  IIRC there was a race in ceph-disk that could prevent this 
from happening in some cases but that is now fixed.  Can you try the 
infernalis branch?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Specify omap path for filestore

2015-11-06 Thread Sage Weil

On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> Can we simply the case as cephFS and RGW both has dedicate metadata pool
>  So we can solve this in deployment, using OSD with keyvaluestore
> backend for it ( on SSD) should be a best fit.

I think that's a good approach for the current code (FileStore and/or 
KeyValueStore).

But for NewStore I'd like to solve this problem directly so that it can be 
used for both cases.  Rocksdb has a mechanism for moving lower level ssts 
to a slower device based on a total size threshold on the main device; 
hopefully this can be used so that we can give it both an ssd and hdd.

sage

>  
> 
> Thus for New-Newstore, we just focus on data pool?
> 
>  
> 
> From: Sage Weil [mailto:s...@newdream.net]
> Sent: Friday, November 6, 2015 1:11 AM
> To: Ning Yao; Chen, Xiaoxi
> Cc: Xue, Chendi; Samuel Just; ceph-devel@vger.kernel.org
> Subject: Re: Specify omap path for filestore
> 
>  
> 
> Yes.  The hard part here in my view is the allocation of space between ssd
> and hdd when the amount of omap data can vary widely, from very little for
> rbd to the entire pool for rgw indexes or cephfs metadata.
> 
> sage
> 
>  
> 
> On November 5, 2015 11:33:48 AM GMT+01:00, Ning Yao 
> wrote:
> 
> Agreed! Actually in different use cases.
> But still not heavily loaded with SSD under small write use case, on
> this point, I may assume that newstore overlay would be much better?
> It seems that we can do more based on NewStore to let the store using
> the raw device directly based on onode_t, data_map (which can act as
> the inode in filesystem), so that we can achieve the whole HDD iops as
> real data without the interference of filesystem-journal and inode
> get/set.
> Regards
> Ning Yao
> 2015-11-04 23:19 GMT+08:00 Chen, Xiaoxi :
> 
>  Hi Ning,
>  Yes, we doesn?t save any IO, or may even need more IO as read amplification b
> y LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS,  IOPS/$$ 
> in SSD(10K+ IOPS per $100) is 2 order cheaper than that
> 
> of in an HDD( 100 IOPS per $100).
>  Some use case:
>  1.When we have enough load, moving any load out of the HDD definitely bring
>  some help. Omap is the thing that could be easily moved out to SSD , note t
> hat omap workload is not intensive but random, which is just fit into the ss
> d already working as journal.
>  2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SS
> D), which will reduce the inode size thus more inode could be cached in memo
> ry. Again, SSD is more than fast for this even sharing with journal.
>  3. in RGW case, we will have some container objects with tons of omap, movi
> ng the omap to SSD is a clear optimization.
>  -Xiaoxi
> 
>  -Original Message-
>  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>  ow...@vger.kernel.org] On Behalf Of Xue, Chendi
> 
> Sent: Wednesday, November 4, 2015 4:15 PM
>  To: Ning Yao
>  Cc: Samuel Just; ceph-devel@vger.kernel.org
>  Subject: RE: Specify omap path for filestore
>  Hi, Ning
>  Thanks for the advice, we did done thing you suggested in our performance
>  tuning work, actually tuning up the usage of memory is the first thing we
>  tried.
>  Firstly, I should guess the omap to ssd benefit shows when we use quite
>  intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost
>  drive each HDD to utility 95+%.
>  We hoped and tested on tune up the inode memory size and fd cache size,
>  since I believe if inode can be always hit in the memory which definitely
>  benefit more than using omap. Sadly our server only has 32G memory total.
>  Even we set xattr size as 65535 as original configured and also fd cache si
> ze as
>  10240 as I remembered, still gain a little to the performance but may lead 
> to
>  OOM of OSD,
> 
> so that is why we came up the solution of moving omap out to
>  a SSD device.
>  Another reason to move omap out is because it helps on performance
>  analysis, since omap uses keyvaluestore, and each rbd request causes one or
>  more 4k inode operation, which lead a frontend and backend throughput
>  ratio as 1: 5.8, which is not that easy to explain the 5.8.
>  Also we can get more randwrite iops if there is no seqwrite to one HDD
>  device, when HDD handles randwrite iops and also some omap(leveldb)
>  write, we can only get 175 iops disk write per HDD when util is nearly full
> .
>  when HDD only handles randwrite without any omap write, we can get 325
>  iops disk write per HDD when HDD util is nearly full.
>  System data please refer to below url
>  http://xuechendi.github.io/data/
>  omap on HDD is before mapping to other device omap on SSD is after
>  Best
> 
> regards,
>  Chend

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-06 Thread Sage Weil

On Thu, 5 Nov 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Thanks Gregory,
> 
> People are most likely busy and haven't had time to digest this and I
> may be expecting more excitement from it (I'm excited due to the
> results and probably also that such a large change still works). I'll
> keep working towards a PR, this was mostly proof of concept, now that
> there is some data I'll clean up the code.

I'm *very* excited about this.  This is something that almost every 
operator has problems with so it's very encouraging to see that switching 
up the queue has a big impact in your environment.

I'm just following up on this after a week of travel, so apologies if this 
is covered already, but did you compare this implementation to the 
original one with the same tunables?  I see somewhere that you had 
max_backfills=20 at some point, which is going to be bad regardless of the 
queue.

I also see that you chnaged the strict priority threshold from LOW to HIGH 
in OSD.cc; I'm curious how much of an impact was from this vs the queue 
implementation.
 
> I was thinking that a config option to choose the scheduler would be a
> good idea. In terms of the project what is the better approach: create
> a new template and each place the template class is instantiated
> select the queue, or perform the queue selection in the same template
> class, or something else I haven't thought of.

A config option would be nice, but I'd start by just cleaning up the code 
and putting it in a new class (WeightedRoundRobinPriorityQueue or 
whatever).  If we find that it's behaving better I'm not sure how much 
value we get from a tunable.  Note that there is one other user 
(msgr/simple/DispatchQueue) that we might also was to switch over at some 
point.. especially if this implementation is faster.

Once it's cleaned up (remove commented out code, new class) put it up as a 
PR and we can review and get it through testing.

Thanks, Robert!
sage


> 
> Are there public teuthology-openstack systems that could be used for
> testing? I don't remember, I'll have to search back through the
> mailing list archives.
> 
> I appreciate all the direction as I've tried to figure this out.
> 
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Nov 4, 2015 at 8:20 PM, Gregory Farnum  wrote:
> > On Wed, Nov 4, 2015 at 7:00 PM, Robert LeBlanc  wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> Thanks for your help on IRC Samuel. I think I found where I made a
> >> mistake. I'll do some more testing. So far with max_backfills=1 on
> >> spindles, the impact of setting an OSD out and in on a saturated
> >> cluster seems to be minimal. On my I/O graphs it is hard to tell where
> >> the OSD was out and in recovering. If I/O becomes blocked, it seems
> >> that they don't linger around long. All of the clients report getting
> >> about the same amount of work done with little variance so no one
> >> client is getting indefinitely blocked (or blocked for really long
> >> times) causing the results between clients to be skewed like before.
> >>
> >> So far this queue seems to be very positive. I'd hate to put a lot of
> >> working getting this ready to merge if there is little interest in it
> >> (a lot of things to do at work and some other things I'd like to track
> >> down in the Ceph code as well). What are some of the next steps for
> >> something like this, meaning a pretty significant change to core code?
> >
> > Well, step one is to convince people it's worthwhile. Your performance
> > information and anecdotal evidence of client impact is a pretty good
> > start. For it to get merged:
> > 1) People will need to review it and verify it's not breaking anything
> > they can identify from code. Things are a bit constricted right now,
> > but this is pretty small and of high interest so I make no promises
> > for the core team but submitting a PR will be the way to start.
> > Getting positive buy-in from other contributors who are interested in
> > performance will also push it up the queue.
> > 2) There will need to be a lot of testing on something like this.
> > Everything has to pass a run of the RADOS suite. Unfortunately this is
> > a bad month for that as the lab is getting physically shipped around
> > in a few weeks, so if you can afford to make it happen with the
> > teuthology-openstack stuff that will accelerate the timeline a lot (we
> > will still need to run it ourselves but once it's passed externally we
> > can put it in a lot more test runs we expect to pass, instead of in a
> > bucket with others that will all get blocked on any one failure).
> > 3) For a new queuing system I suspect that rather than a direct merge
> > to default master, Sam will want to keep both in the code for a while
> > with a config value and run a lot of the nightlies on this one to
> > tease out any subtle races an

Re: civetweb upstream/downstream divergence

2015-11-04 Thread Sage Weil

On Wed, 4 Nov 2015, Ken Dreyer wrote:
> On Wed, Nov 4, 2015 at 1:25 PM, Ken Dreyer  wrote:
> > On Tue, Nov 3, 2015 at 4:22 AM, Sage Weil  wrote:
> >> On Tue, 3 Nov 2015, Nathan Cutler wrote:
> >>> IMHO the first step should be to get rid of the evil submodule. Arguably
> >>> the most direct path leading to this goal is to simply package up the
> >>> downstream civetweb (i.e. 1.6 plus all the downstream patches) for all
> >>> the supported distros. The resulting package would be Ceph-specific,
> >>> obviously, so it could be called "civetweb-ceph".
> >>>
> >>> Like Ken says, the upstreaming effort can continue in parallel.
> >>
> >> I'm not sure I agree.  As long as everything is not upstream and we are
> >> running a fork, what is the value of having it in a separate package?
> >> That just means all of the effort of managing the package dependency and
> >> making sure it is in all of the appropriate distros (and similar pain for
> >> those building manually) without any of the benefits (upstream bug fixes,
> >> etc.).
> >
> > I think there's value in getting the packaging bits ready ahead of
> > time and letting those "bake in" in Fedora/Ubuntu/Debian/SUSE while we
> > continue to merge Ceph's civetweb changes to Civetweb upstream.
> >
> > Now that Civetweb with RGW is mainstream, I'm looking forward to
> > eventually using a pre-built civetweb package that can shave time off
> > our Ceph Gitbuilder/Jenkins runs :)
> 
> Oh, I just re-read this, and Nathan's proposing to package up
> "civetweb-ceph" as a fork... I'm not sure that's worth it (at least,
> speaking for packaging in Fedora).
> 
> When I was talking about a "parallel effort", what I meant is that
> we'd get vanilla civetweb upstream into the distros, and we'd also
> continue to bundle civetweb in Ceph, until we can reliably use the
> upstream Civetweb package.

Ah, this sounds better to me.  There may be some work to build civetweb as 
a shared library (currently it's just a statically linked module) but 
probably not too bad.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

dm-clock queue

2015-11-04 Thread Sage Weil

Hi Gunna, Eric-

I wanted to make sure you were connected as we've talked to both of you 
independently about the new request queue in the OSD to support dm-clock 
and I want to make sure our efforts are coordinated.  I thnk the first 
goal is probably to implement something that works and performs well for 
just a few request classes (clients, recovery, scrub, snaptrim).  
Eventually we'll also need to determine if/how to do so for a large client 
count so that we can do client qos.  Just solving the first problem alone 
may be a big win, though: we hear lots of complaints about the effect of 
recovery on client io.

Anyway, just wanted to make sure you two were connected and kick off the 
conversation.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph encoding optimization

2015-11-04 Thread Sage Weil

On Wed, 4 Nov 2015, ??? wrote:
> hi, all:
> 
>  I am focus on the cpu usage of ceph now. I find the struct (such
> as pg_info_t , transaction and so on) encode and decode exhaust too
> much cpu resource.
> 
>  For now, we should encode every member variable one by one which
> calling encode_raw finally. When there are many members, we should
> encode it many times. But I think, we could reduce some in some cases.
> 
>  For example , struct A { int a; int b; int c }; ceph would
> encoding int a , and then encode int b , finally int c. But for this
> case , we could calling bufferlist.append((char *)(&a), sizeof(A))
> because there are not padding bytes in this struct.
> 
>  I use the above optimization, the cpu usage of object_stat_sum_t
> encoding decrease from 0.5% to 0% (I could not see any using perf
> tools).
> 
>  This is only a case, so I think we could do similar optimization
> other struct. I think we should pay attention to the padding in
> struct.

We have to be careful because we need to ensure that the encoding 
is little-endian.  I think teh way to do this is to define a struct like

struct foo_t {
  __le64 a;
  __le64 b;
  __le64 c;
  ...
};

and just make a foo_t *p that points into the buffer.  There were some 
patches that did this that came out of the Portland ahckathon, but I'm not 
sure where they are... Josh, do you remember?

FWIW I think pg_stat_t (and friends) is a good first start since it is 
expensive and part of the MOSDOp and MOSDRepOp.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: civetweb upstream/downstream divergence

2015-11-03 Thread Sage Weil

On Tue, 3 Nov 2015, Nathan Cutler wrote:
> IMHO the first step should be to get rid of the evil submodule. Arguably
> the most direct path leading to this goal is to simply package up the
> downstream civetweb (i.e. 1.6 plus all the downstream patches) for all
> the supported distros. The resulting package would be Ceph-specific,
> obviously, so it could be called "civetweb-ceph".
> 
> Like Ken says, the upstreaming effort can continue in parallel.

I'm not sure I agree.  As long as everything is not upstream and we are 
running a fork, what is the value of having it in a separate package?  
That just means all of the effort of managing the package dependency and 
making sure it is in all of the appropriate distros (and similar pain for 
those building manually) without any of the benefits (upstream bug fixes, 
etc.).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ordered writeback for rbd client cache

2015-11-02 Thread Sage Weil

Just found this:


https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller

which should be helpful in constructing a persistent client-side writeback 
cache for RBD that preserves consistency.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why we use two ObjectStore::Transaction in ReplicatedBackend::submit_transaction?

2015-11-01 Thread Sage Weil

On Sun, 1 Nov 2015, Sage Weil wrote:
> On Sun, 1 Nov 2015, ??? wrote:
> > Yes, I think so.
> > keeping them separate and pass them to
> > ObjectStore::queue_transactions() would increase the time on
> > transaction encode process and exhaust more cpu.
> > 
> > The transaction::append holds 0.8% cpu on my environment.
> > The transaction encoding is also really a bottleneck which process
> > holds 1.8% cpu on my environment.
> 
> Where is the append() caller you're looking at?  I'm not seeing it.

Oh, I see:  https://github.com/ceph/ceph/pull/6439

This makes sense to me.
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why we use two ObjectStore::Transaction in ReplicatedBackend::submit_transaction?

2015-11-01 Thread Sage Weil

On Sun, 1 Nov 2015, ??? wrote:
> Yes, I think so.
> keeping them separate and pass them to
> ObjectStore::queue_transactions() would increase the time on
> transaction encode process and exhaust more cpu.
> 
> The transaction::append holds 0.8% cpu on my environment.
> The transaction encoding is also really a bottleneck which process
> holds 1.8% cpu on my environment.

Where is the append() caller you're looking at?  I'm not seeing it.

sage


> 
> 2015-11-01 4:42 GMT+08:00 Sage Weil :
> > On Sat, 31 Oct 2015, Ning Yao wrote:
> >> Yeah, since issue_op is called before log_operation, we may consider
> >> to reuse op_t after sent encoded op_t to the wire. local_t.append(),
> >> at least, does copy the op_bl in op_t transaction and we may avoid
> >> this memory copy, and if we can avoid this append operation as well as
> >> in sub_op_modify_impl(), it, at least, improves the performance 1%~2%
> >> under my testing environment using ssd as Filestore backend.
> >> The only difference we find in this path is that local_t should be
> >> done first, but actually it seems that the order of the transaction is
> >> not quite important. If so, we may refactor and improve this?
> >
> > I seem to recall that in teh EC case the order does matter (I had switched
> > the append order when trying to fix this before but had to revert because
> > things broke).
> >
> > And I'm a bit nervous about re-using local_t and relying on the send vs
> > submit timing.  Is it not practical to keep them separate and
> > pass them both down to ObjectStore::queue_transactions()?
> >
> > sage
> 
> 
> 
> -- 
> Regards,
> xinze
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why we use two ObjectStore::Transaction in ReplicatedBackend::submit_transaction?

2015-10-31 Thread Sage Weil

On Sat, 31 Oct 2015, Ning Yao wrote:
> Yeah, since issue_op is called before log_operation, we may consider
> to reuse op_t after sent encoded op_t to the wire. local_t.append(),
> at least, does copy the op_bl in op_t transaction and we may avoid
> this memory copy, and if we can avoid this append operation as well as
> in sub_op_modify_impl(), it, at least, improves the performance 1%~2%
> under my testing environment using ssd as Filestore backend.
> The only difference we find in this path is that local_t should be
> done first, but actually it seems that the order of the transaction is
> not quite important. If so, we may refactor and improve this?

I seem to recall that in teh EC case the order does matter (I had switched 
the append order when trying to fix this before but had to revert because 
things broke).

And I'm a bit nervous about re-using local_t and relying on the send vs 
submit timing.  Is it not practical to keep them separate and 
pass them both down to ObjectStore::queue_transactions()?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why we use two ObjectStore::Transaction in ReplicatedBackend::submit_transaction?

2015-10-31 Thread Sage Weil

On Sat, 31 Oct 2015, ??? wrote:
> hi, all:
> 
> There are two ObjectStore::Transaction in
> ReplicatedBackend::submit_transaction, one is op_t and the other one
> is local_t. Is that something
> critilal logic we should consider?
> 
> If we could reuse variable op_t it would be great. Because it is
> expensive to calling local_t.append(*op_t).
> 
> There are similar logic in ReplicatedBackend::sub_op_modify_impl.

The local_t items are only applied locally; the op_t items are encoded and 
sent over the wire to the replicas.

If append() is expensive we should just refactor to avoid that.  IIRC I 
got partway down this path but apparently didn't finish.  The ObjectStore 
interface takes a list of transactions to apply, so I think it's just a 
matter of refactoring the interfaces a bit...?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Ceph fix for 4.3

2015-10-30 Thread Sage Weil

Hi Linus,

Please pull the following RBD fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This sets the stable pages flag on the RBD block device when we have CRCs 
enabled.  (This is necessary since the default assumption for block 
devices changed in 3.9.)

Thanks!
sage



Ronny Hegewald (1):
  rbd: require stable pages if message data CRCs are enabled

 drivers/block/rbd.c | 3 +++
 1 file changed, 3 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fix OP dequeuing order

2015-10-28 Thread Sage Weil

On Wed, 28 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I created a pull request to fix an op dequeuing order problem. I'm not
> sure if I need to mention it here.
> 
> https://github.com/ceph/ceph/pull/6417

Wow, good catch.  Have you found that this materially impacts the behavior 
in your cluster?

sage


> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWMVh6CRDmVDuy+mK58QAAztQP/385BOI8AH2uEJhN8pQ4
> QnAJxRy4HceWzjfAUulqNbbiD1scHZMU7LDW1GtsXfOZzmndTnJSBrR4+aHq
> F7py9zgXcxXH4uTAoILbRzkCF3rWdmkeh1/m5aY4LqmhE2N/O/LLOmDUe2BT
> XkQgZ9sROzY9pSj6pjA2vuv7k2u1SWtF3Ky14Hll3LHjqJibXoXYy+ik7lOP
> lRUoAY08Yf+c/Ag/Yy7CLGgIk/y6mdaJZPd2PCaVsKFa55NJAlYv0PHJKX0j
> XkSAY10MednMX6N+QL8XAq+yiAd//UADfCNhxHkP84YsPPCpNeS1OcoF6WGG
> g5H8uMK84kZCk37ummW/ANg9WNnO3hN2j22r9ezA+4GfxqKibT4lEMba6h88
> i5L3rQwWmM0cdpjS9plH1yUiPP2DexJV8PaiAIVVMAkw+AC0Xb/nUXKX6u5+
> YU744kSjtscN95Caf72V6HirB/uEU4sm+4lUuUBHzTcvau/r9WUHezwvmUiH
> HHL9bSU5TJ4jXvQhDEBYKbflTzLNKjXPcp1PagN2P9ZWQvNaxrQm32iB84DW
> 6jLEArFX10kE3eZ8IqoBikw5d+y3YtnuJ1oAIkfzj1ANofm37VKcQY/Wfrjw
> eke0nR4QBuN6SibbPXqIsjjIWZdo/jCgOCylNONXCFn9Qp08/7UJMQtzHk/1
> xRRp
> =g+NJ
> -END PGP SIGNATURE-
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Newstore] FIO Read's from Cient causes OSD * Caught signal (Aborted)

2015-10-28 Thread Sage Weil

Hi Vish-

This is not too surprising, but I am inclined to ignore it for now: i'm in 
the midst of a major rewrite anyway to use a raw block device instead of 
the file system.

sage

On Wed, 28 Oct 2015, Vish (Vishwanath) Maram-SSI wrote:

> Hi,
> 
> We are observing a crash of OSD whenever we run FIO Read's from a client. 
> Setup is very simple and explained as below:
> 
> 1. One OSD with Ceph Version "ceph version 9.1.0-420-ge3921a8 
> (e3921a8396870be4a38ce1f1b6c35bc0829dbb68)", pulled the code from GIT and 
> compiled/Installed.
> 2. One Client with same version of CEPH.
> 3. FIO Version - fio-2.2.10-16-gd223
> 4. Ceph Conf as given below
> 5. Crash log details from log file as below
> 6. FIO Script as given below
> 
> CEPH Conf -
> 
> [global]
> fsid = 9eda02e2-04b7-4eed-a85a-8471ea51528d
> mon_initial_members = msl-dsma-spoc08
> mon_host = 10.10.10.190
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> auth_supported = none
> 
> #Needed for Newstore 
> osd_objectstore = newstore
> enable experimental unrecoverable data corrupting features = newstore, rocksdb
> newstore_backend = rocksdb
> 
> #Debug - Start Removed for  now to debug
> #newstore_max_dir_size = 4096
> #newstore_sync_io = true
> #newstore_sync_transaction = true
> #newstore_sync_submit_transaction = true
> #newstore_sync_wal_apply = true
> #newstore_overlay_max = 0
> #Debug - End
> 
> #Needed for Newstore
> 
> filestore_xattr_use_omap = true
> 
> osd pool default size = 1
> rbd cache = false
> 
> 
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> osd_op_threads = 5
> osd_op_num_threads_per_shard = 1
> osd_op_num_shards = 25
> #osd_op_num_sharded_pool_threads = 25
> filestore_op_threads = 4
> ms_nocrc = true
> filestore_fd_cache_size = 64
> filestore_fd_cache_shards = 32
> cephx sign messages = false
> cephx require signatures = false
> ms_dispatch_throttle_bytes = 0
> throttler_perf_counter = false
> 
> [osd]
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false 
> 
> Crash details from the log:
>   -194> 2015-10-28 10:54:40.792957 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba915510
>   -193> 2015-10-28 10:54:40.792959 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba916590
>   -192> 2015-10-28 10:54:40.792962 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba914990
>   -191> 2015-10-28 10:54:40.792965 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba916490
>   -190> 2015-10-28 10:54:40.792968 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba916090
>   -189> 2015-10-28 10:54:40.792971 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba915c10
>   -188> 2015-10-28 10:54:40.792975 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba917190
>   -187> 2015-10-28 10:54:40.792977 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba916810
>   -186> 2015-10-28 10:54:40.792980 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba914790
>   -185> 2015-10-28 10:54:40.792983 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba915e10
>   -184> 2015-10-28 10:54:40.792986 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba915f10
>   -183> 2015-10-28 10:54:40.792988 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba915f90
>  -182> 2015-10-28 10:54:40.792992 7f15862e8700  2 
> newstore(/var/lib/ceph/osd/ceph-0) _do_wal_transaction prepared aio 
> 0x7f15ba914510
>  ...
>  
>    -10> 2015-10-28 10:55:45.240480 7f1577366700  5 
> newstore(/var/lib/ceph/osd/ceph-0) queue_transactions existing 0x7f15a0ac1180 
> osr(1.b1 0x7f15a025acf0)
>     -9> 2015-10-28 10:55:45.240830 7f1577366700  5 
> newstore(/var/lib/ceph/osd/ceph-0) queue_transactions existing 0x7f15a0ac1180 
> osr(1.b1 0x7f15a025acf0)
>     -8> 2015-10-28 10:55:45.241135 7f1577366700  5 
> newstore(/var/lib/ceph/osd/ceph-0) queue_transactions existing 0x7f15a0ac1180 
> osr(1.b1 0x7f15a025acf0)
>

Re: pg scrub check problem

2015-10-28 Thread Sage Weil

On Wed, 28 Oct 2015, changtao381 wrote:
> Hi,
> 
> I?m testing the deep-scrub function of ceph.  And the test steps are below :
> 
> 1)  I put an object on ceph using command :
>  rados put test.txt test.txt ?p testpool
> 
> The size of testpool is 3, so there three replicates on three osds:
> 
> osd.0:   /data1/ceph_data/osd.0/current/1.0_head/test.txt__head_8B0B6108__1
> osd.1:   /data2/ceph_data/osd.1/current/1.0_head/test.txt__head_8B0B6108__1
> osd.2/data3/ceph_data/osd.2/current/1.0_head/test.txt__head_8B0B6108__1
> 
> 2) I modified the content of one replica on osd.0 using vim editor directly 
> on disk
> 
> 3) I run the command 
> ?ceph pg deep-scrub 1.0
> 
> and expect it can check the inconsistent error out, but it fails. It doesn?t 
> find the error
> why? 

Becuse you *just* wrote the object, and the FileStore caches open file 
handles.  Vim renames a new inode over the old one so the open inode is 
untouched.

If you restart the osd and then scrub you'll see the error.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: values of "ceph daemon osd.x perf dump objecters " are zero

2015-10-28 Thread Sage Weil

Objecter is the client side, but you're dumping stats on the osd.  The 
only time it is used as a client there is with cache tiering.

sage

On Wed, 28 Oct 2015, Libin Wu wrote:

> Hi, all
> 
> As my understand, command "ceph daemon osd.x perf dump objecters" should
> output the perf data of osdc(librados). But when i use this command,
> why all those values are zero expcept map_epoch and map_inc. Follow is
> the result(It has fio test with rbd ioengine on the cluster):
> 
> 
> $ sudo ceph daemon osd.10 perf dump objecter
> 
> {
> 
> "objecter": {
> 
> "op_active": 0,
> 
> "op_laggy": 0,
> 
> "op_send": 0,
> 
> "op_send_bytes": 0,
> 
> "op_resend": 0,
> 
> "op_ack": 0,
> 
> "op_commit": 0,
> 
> "op": 0,
> 
> "op_r": 0,
> 
> "op_w": 0,
> 
> "op_rmw": 0,
> 
> "op_pg": 0,
> 
> "osdop_stat": 0,
> 
> "osdop_create": 0,
> 
> "osdop_read": 0,
> 
> "osdop_write": 0,
> 
> "osdop_writefull": 0,
> 
> "osdop_append": 0,
> 
> "osdop_zero": 0,
> 
> "osdop_truncate": 0,
> 
> "osdop_delete": 0,
> 
> "osdop_mapext": 0,
> 
> "osdop_sparse_read": 0,
> 
> "osdop_clonerange": 0,
> 
> "osdop_getxattr": 0,
> 
> "osdop_setxattr": 0,
> 
> "osdop_cmpxattr": 0,
> 
> "osdop_rmxattr": 0,
> 
> "osdop_resetxattrs": 0,
> 
> "osdop_tmap_up": 0,
> 
> "osdop_tmap_put": 0,
> 
> "osdop_tmap_get": 0,
> 
> "osdop_call": 0,
> 
> "osdop_watch": 0,
> 
> "osdop_notify": 0,
> 
> "osdop_src_cmpxattr": 0,
> 
> "osdop_pgls": 0,
> 
> "osdop_pgls_filter": 0,
> 
> "osdop_other": 0,
> 
> "linger_active": 0,
> 
> "linger_send": 0,
> 
> "linger_resend": 0,
> 
> "linger_ping": 0,
> 
> "poolop_active": 0,
> 
> "poolop_send": 0,
> 
> "poolop_resend": 0,
> 
> "poolstat_active": 0,
> 
> "poolstat_send": 0,
> 
> "poolstat_resend": 0,
> 
> "statfs_active": 0,
> 
> "statfs_send": 0,
> 
> "statfs_resend": 0,
> 
> "command_active": 0,
> 
> "command_send": 0,
> 
> "command_resend": 0,
> 
> "map_epoch": 2180,
> 
> "map_full": 0,
> 
> "map_inc": 83,
> 
> "osd_sessions": 0,
> 
> "osd_session_open": 0,
> 
> "osd_session_close": 0,
> 
> "osd_laggy": 0
> 
> }
> 
> }
> 
> Anyone could tell why?
> 
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

v0.94.5 Hammer released

2015-10-26 Thread Sage Weil

This Hammer point release fixes a critical regression in librbd that can 
cause Qemu/KVM to crash when caching is enabled on images that have been 
cloned.

All v0.94.4 Hammer users are strongly encouraged to upgrade.

Notable Changes
===

* librbd: potential assertion failure during cache read (#13559, Jason 
  Dillaman)
* osd: osd/ReplicatedPG: remove stray debug line (#13455, Sage Weil)
* tests: qemu workunit refers to apt-mirror.front.sepia.ceph.com (#13420, 
  Yuan Zhou)

For the complete changelog, see

  http://docs.ceph.com/docs/master/_downloads/v0.94.5.txt

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-0.94.5.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why package ceph-fuse needs packages ceph?

2015-10-26 Thread Sage Weil

On Mon, 26 Oct 2015, Jaze Lee wrote:
> Hello,
> I think the ceph-fuse is just a client, why it needs packages ceph?
> I found when i install ceph-fuse, it will install package ceph.
> But when i install ceph-common, it will not install package ceph.
> 
> May be ceph-fuse is not just a ceph client?
> 

It is, and the Debian packaging works as expected.  This is a simple 
error in the spec file.  I'll submit a patch.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3892 matches

Mail list logo