date:20170925

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-25 Thread Marc Roos

>From the looks of it, to bad the efforts could not be 
combined/coordinated, that seems to be an issue with many open source 
initiatives.

-Original Message-
From: mj [mailto:li...@merit.unu.edu] 
Sent: zondag 24 september 2017 16:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

Hi,

I forwarded your announcement to the dovecot  mailinglist. The following 
reply to it was posted by there by Timo Sirainen. I'm forwarding it 
here, as you might not be reading the dovecot mailinglist.

Wido:
> First, the Github link:
> https://github.com/ceph-dovecot/dovecot-ceph-plugin
> 
> I am not going to repeat everything which is on Github, put a short 
summary:
> 
> - CephFS is used for storing Mailbox Indexes
> - E-Mails are stored directly as RADOS objects
> - It's a Dovecot plugin
> 
> We would like everybody to test librmb and report back issues on 
Github so that further development can be done.
> 
> It's not finalized yet, but all the help is welcome to make librmb the 
best solution for storing your e-mails on Ceph with Dovecot.

Timo:
It would be have been nicer if RADOS support was implemented as lib-fs 
driver, and the fs-API had been used all over the place elsewhere. So 1) 
LibRadosMailBox wouldn't have been relying so much on RADOS specifically 
and 2) fs-rados could have been used for other purposes. There are 
already fs-dict and dict-fs drivers, so the RADOS dict driver may not 
have been necessary to implement if fs-rados was implemented instead 
(although I didn't check it closely enough to verify). (We've had 
fs-rados on our TODO list for a while also.)

BTW. We've also been planning on open sourcing some of the obox pieces, 
mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without 
the "metacache" piece. The current obox code is a bit too much married 
into the metacache though to make open sourcing it easy. (The metacache 
is about storing the Dovecot index files in object storage and 
efficiently caching them on local filesystem, which isn't planned to be 
open sourced in near future. That's pretty much the only difficult piece 
of the obox plugin, with Cassandra integration coming as a good second. 
I wish there had been a better/easier geo-distributed key-value database 
to use - tombstones are annoyingly troublesome.)

And using rmb-mailbox format, my main worries would be:
  * doesn't store index files (= message flags) - not necessarily a 
problem, as long as you don't want geo-replication
  * index corruption means rebuilding them, which means rescanning list 
of mail files, which means rescanning the whole RADOS namespace, which 
practically means  rescanning the RADOS pool. That most likely is a very 
very slow operation, which you want to avoid unless it's absolutely 
necessary. Need to be very careful to avoid that happening, and in 
general to avoid losing mails in case of crashes or other bugs.
  * I think copying/moving mails physically copies the full data on disk
  * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from 
each others - some connection pooling would likely help here

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-25 Thread Wido den Hollander


> Op 22 september 2017 om 23:56 schreef Gregory Farnum :
> 
> 
> On Fri, Sep 22, 2017 at 2:49 PM, Danny Al-Gaaf  
> wrote:
> > Am 22.09.2017 um 22:59 schrieb Gregory Farnum:
> > [..]
> >> This is super cool! Is there anything written down that explains this
> >> for Ceph developers who aren't familiar with the workings of Dovecot?
> >> I've got some questions I see going through it, but they may be very
> >> dumb.
> >>
> >> *) Why are indexes going on CephFS? Is this just about wanting a local
> >> cache, or about the existing Dovecot implementations, or something
> >> else? Almost seems like you could just store the whole thing in a
> >> CephFS filesystem if that's safe. ;)
> >
> > This is, if everything works as expected, only an intermediate step. An
> > idea is
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/status-3)
> > be to use omap to store the index/meta data.
> >
> > We chose a step-by-step approach and since we are currently not sure if
> > using omap would work performance wise, we use CephFS (also since this
> > requires no changes in Dovecot). Currently we put our focus on the
> > development of the first version of librmb, but the code to use omap is
> > already there. It needs integration, testing, and performance tuning to
> > verify if it would work with our requirements.
> >
> >> *) It looks like each email is getting its own object in RADOS, and I
> >> assume those are small messages, which leads me to
> >
> > The mail distribution looks like this:
> > https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-dist
> >
> >
> > Yes, the majority of the mails are under 500k, but the most objects are
> > around 50k. Not so many very small objects.
> 
> Ah, that slide makes more sense with that context — I was paging
> through it in bed last night and thought it was about the number of
> emails per user or something weird.
> 
> So those mail objects are definitely bigger than I expected; interesting.
> 
> >
> >>   *) is it really cost-acceptable to not use EC pools on email data?
> >
> > We will use EC pools for the mail objects and replication for CephFS.
> >
> > But even without EC there would be a cost case compared to the current
> > system. We will save a large amount of IOPs in the new platform since
> > the (NFS) POSIX layer is removed from the IO path (at least for the mail
> > objects). And we expect with Ceph and commodity hardware we can compete
> > with a traditional enterprise NAS/NFS anyway.
> >
> >>   *) isn't per-object metadata overhead a big cost compared to the
> >> actual stored data?
> >
> > I assume not. The metadata/index is not so much compared to the size of
> > the mails (currently with NFS around 10% I would say). In the classic
> > NFS based dovecot the number of index/cache/metadata files is an issue
> > anyway. With 6.7 billion mails we have 1.2 billion index/cache/metadata
> > files
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-nums).
> 
> I was unclear; I meant the RADOS metadata cost of storing an object. I
> haven't quantified that in a while but it was big enough to make 4KB
> objects pretty expensive, which I was incorrectly assuming would be
> the case for most emails.
> EC pools have the same issue; if you want to erasure-code a 40KB
> object into 5+3 then you pay the metadata overhead for each 8KB
> (40KB/5) of data, but again that's more on the practical side of
> things than my initial assumptions placed it.

Yes, it is. But combining object isn't easy either. RGW also has this 
limitation where objects are striped in RADOS and the EC overhead can become 
large.

At this moment the price/GB (correct me if needed Danny!) isn't th biggest 
problem. It could be that all mails will be stored on a replicated pool.

There also might be some overhead in BlueStore per object, but the number of 
Deutsche Telekom show that mails usually aren't 4kb. Only a small portion of 
e-mails is 4kb.

We will see how this turns out.

Wido

> 
> This is super cool!
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-25 Thread Danny Al-Gaaf

Am 25.09.2017 um 09:00 schrieb Marc Roos:
>  
>>From the looks of it, to bad the efforts could not be 
> combined/coordinated, that seems to be an issue with many open source 
> initiatives.

That's not right. The plan is to contribute the librmb code to the Ceph
project and the Dovecot part back to the Dovecot project (as described
in the slides) as soon as we know that it will work with real live load.

We simply needed a place to start with it, then we split the code into
parts to move it to the corresponding projects.

Danny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Updating ceps client - what will happen to services like NFS on clients

2017-09-25 Thread Götz Reinicke

Hi,

I updated our ceph OSD/MON Nodes from 10.2.7 to 10.2.9 and everything looks 
good so far.

Now I was wondering (as I may have forgotten how this works) what will happen 
to a  NFS server which has the nfs shares on a ceph rbd ? Will the update 
interrupt any access to the NFS share or is it that smooth that e.g. clients 
accessing the NFS share will not notice?

Thanks for some lecture on managing ceph and regards . Götz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-25 Thread Marc Roos

 

But from the looks of this dovecot mailinglist post, you didn’t start 
your project with talking to the dovecot guys, or have an ongoing 
communication with them during the development. I would think with that 
their experience could be a valuable asset. I am not talking about just 
giving some files at the end.

Ps. Is there some index of these slides? I have problems browsing back 
to a specific one constantly.


-Original Message-
From: Danny Al-Gaaf [mailto:danny.al-g...@bisect.de] 
Sent: maandag 25 september 2017 9:37
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

Am 25.09.2017 um 09:00 schrieb Marc Roos:
>  
>>From the looks of it, to bad the efforts could not be
> combined/coordinated, that seems to be an issue with many open source 
> initiatives.

That's not right. The plan is to contribute the librmb code to the Ceph 
project and the Dovecot part back to the Dovecot project (as described 
in the slides) as soon as we know that it will work with real live load.

We simply needed a place to start with it, then we split the code into 
parts to move it to the corresponding projects.

Danny


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] erasure code profile

2017-09-25 Thread Vincent Godin

If you have at least 2 hosts per room, you can use a k=3 and m=3 and
place 2 shards per room (one on each host). So you'll need 3 shards to
read the data : you can loose a room and one host in the two other
rooms and still get your data.It covers a double faults which is
better.
It will take more space : your proposal uses 1,5 x Data; this one uses 2 x Data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread TYLin

Hi, 

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block 
 
Seems we don’t have a formula or suggestion to the size of block.db. It depends 
on the object size and number of objects in your pool. You can just give big 
partition to block.db to ensure all the database files are on that fast 
partition. If block.db full, it will use block to put db files, however, this 
will slow down the db performance. So give db size as much as you can. 

If you want to put wal and db on same ssd, you don’t need to create block.wal. 
It will implicitly use block.db to put wal. The only case you need block.wal is 
that you want to separate wal to another disk.

I’m also studying bluestore, this is what I know so far. Any correction is 
welcomed.

Thanks


> On Sep 22, 2017, at 5:27 PM, Richard Hesketh  
> wrote:
> 
> I asked the same question a couple of weeks ago. No response I got 
> contradicted the documentation but nobody actively confirmed the 
> documentation was correct on this subject, either; my end state was that I 
> was relatively confident I wasn't making some horrible mistake by simply 
> specifying a big DB partition and letting bluestore work itself out (in my 
> case, I've just got HDDs and SSDs that were journals under filestore), but I 
> could not be sure there wasn't some sort of performance tuning I was missing 
> out on by not specifying them separately.
> 
> Rich
> 
> On 21/09/17 20:37, Benjeman Meekhof wrote:
>> Some of this thread seems to contradict the documentation and confuses
>> me.  Is the statement below correct?
>> 
>> "The BlueStore journal will always be placed on the fastest device
>> available, so using a DB device will provide the same benefit that the
>> WAL device would while also allowing additional metadata to be stored
>> there (if it will fix)."
>> 
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>> 
>>  it seems to be saying that there's no reason to create separate WAL
>> and DB partitions if they are on the same device.  Specifying one
>> large DB partition per OSD will cover both uses.
>> 
>> thanks,
>> Ben
>> 
>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>  wrote:
>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
 
 On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>> 
>>> Hi,
>>> 
>>> I'm still looking for the answer of these questions. Maybe someone can
>>> share their thought on these. Any comment will be helpful too.
>>> 
>>> Best regards,
>>> 
>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>> mailto:mrxlazuar...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> 1. Is it possible configure use osd_data not as small partition on
>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>> ceph-disk and any pros/cons of doing that?
>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>> throughput like on journal device of filestore? If no, what is the
>>> default value and pro/cons of adjusting that?
>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>> if using separate device for them?
>>> 
>>> Best regards,
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> I am also looking for recommendations on wal/db partition sizes. Some
>> hints:
>> 
>> ceph-disk defaults used in case it does not find
>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>> 
>> wal =  512MB
>> 
>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>> of it else it uses 1G.
>> 
>> There is also a presentation by Sage back in March, see page 16:
>> 
>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>> 
>> 
>> wal: 512 MB
>> 
>> db: "a few" GB
>> 
>> the wal size is probably not debatable, it will be like a journal for
>> small block sizes which are constrained by iops hence 512 MB is more
>> than enough. Probably we will see more on the db size in the future.
> This is what I understood so far.
> I wonder if it makes sense to set the db size as big as possible and
> divide entire db device is  by the number of OSDs it will serve.
> 
> E.g. 10 OSDs / 1 NVME (800GB)
> 
>  (800GB - 10x1GB wal ) / 10 = ~79Gb

Re: [ceph-users] [RGW] SignatureDoesNotMatch using curl

2017-09-25 Thread Дмитрий Глушенок

You must use triple "\n" with GET in stringToSign. See 
http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html

> 18 сент. 2017 г., в 12:23, junho_k...@tmax.co.kr  
> написал(а):
> 
> I’m trying to use Ceph Object Storage in CLI.
> I used curl to make a request to the RGW with S3 way.
>  
> When I use a python library, which is boto, all things work fine, but when I 
> tried to make same request using curl, I always got error 
> “SignatureDoesNotMatch”
> I don’t know what goes wrong.
>  
> Here is my script when I tried to make a request using curl
> ---
> #!/bin/bash
>  
> resource="/my-new-bucket/"
> dateValue=`date -Ru`
> S3KEY="MY_KEY"
> S3SECRET="MY_SECRET_KEY"
> stringToSign="GET\n\n${dateValue}\n${resource}"
> signature=`echo -en ${stringToSign} | openssl sha1 -hmac ${S3SECRET} -binary 
> | base64`
>  
> curl -X GET \
> -H "authorization: AWS ${S3KEY}:${signature}"\
> -H "date: ${dateValue}"\
> -H "host: 10.0.2.15:7480"\
> http://10.0.2.15:7480/my-new-bucket  
> --verbose
>  
> 
>  
> The result 
> 
>  encoding="UTF-8"?>SignatureDoesNotMatchtx00019-0059bf7de0-5e25-default5e25-default-default
> 
>  
> Ceph log in /var/log/ceph/ceph-client.rgw.node0.log is
> 
> 2017-09-18 16:51:50.922935 7fc996fa5700  1 == starting new request 
> req=0x7fc996f9f7e0 =
> 2017-09-18 16:51:50.923135 7fc996fa5700  1 == req done req=0x7fc996f9f7e0 
> op status=0 http_status=403 ==
> 2017-09-18 16:51:50.923156 7fc996fa5700  1 civetweb: 0x7fc9cc00d0c0: 
> 10.0.2.15 - - [18/Sep/2017:16:51:50 +0900] "GET /my-new-bucket HTTP/1.1" 403 
> 0 - curl/7.47.0
> 
>  
> Many Thanks
> -Juno
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph release cadence

2017-09-25 Thread Joao Eduardo Luis


I am happy with this branch of the thread!

I'm guessing this would start post-Mimic though, if no one objects and 
if we want to target a March release?


  -Joao

On 09/23/2017 02:58 AM, Sage Weil wrote:

On Fri, 22 Sep 2017, Gregory Farnum wrote:

On Fri, Sep 22, 2017 at 3:28 PM, Sage Weil  wrote:

Here is a concrete proposal for everyone to summarily shoot down (or
heartily endorse, depending on how your friday is going):

- 9 month cycle
- enforce a predictable release schedule with a freeze date and
   a release date.  (The actual .0 release of course depends on no blocker
   bugs being open; not sure how zealous 'train' style projects do
   this.)


Train projects basically commit to a feature freeze enough in advance
of the expected release date that it's feasible, and don't let people
fake it by rushing in stuff they "finished" the day before. I'm not
sure if every-9-month LTSes will be more conducive to that or not — if
we do scheduled releases, we still fundamentally need to be able to
say "nope, that feature we've been saying for 9 months we hope to have
out in this LTS won't make it until the next one". And we seem pretty
bad at that.


I'll be the first to say I'm no small part of the "we" there.  But I'm
also suggesting that's not a reason not to try to do better.  As I
said I think this will be easier than in the past because we don't
have as many headline features we're trying to wedge in.

In any case, is there an alternative way to get to the much-desired
regular cadence?


- no more even/odd pattern; all stable releases are created equal.
- support upgrades from up to 3 releases back.

This shortens the cycle a bit to relieve the "this feature must go in"
stress, without making it so short as to make the release pointless (e.g.,
infernalis, kraken).  (I also think that the feature pressure is much
lower now than it has been in the past.)

This creates more work for the developers because there are more upgrade
paths to consider: we no longer have strict "choke points" (like all
upgrades must go through luminous).  We could reserve the option to pick
specific choke point releases in the future, perhaps taking care to make
sure these are the releases that go into downstream distros.  We'll need
to be more systematic about the upgrade testing.


This sounds generally good to me — we did multiple-release upgrades
for a long time, and stuff is probably more complicated now but I
don't think it will actually be that big a deal.

3 releases back might be a bit much though — that's 27 months! (For
luminous, the beginning of 2015. Hammer.)


I'm *much* happier with 2 :) so no complaint from me.  I just heard a lot
of "2 years" and 2 releases (18 months) doesn't quite cover it.  Maybe
it's best to start with that, though?  It's still an improvement over the
current ~12 months.


Somewhat separately, several people expressed concern about having stable
releases to develop against.  This is somewhat orthogonal to what users
need.  To that end, we can do a dev checkpoint every 1 or 2 months
(preferences?), where we fork a 'next' branch and stabilize all of the
tests before moving on.  This is good practice anyway to avoid
accumulating low-frequency failures in the test suite that have to be
squashed at the end.


So this sounds like a fine idea to me, but how do we distinguish this
from the intermediate stable releases?

By which I mean, are we *really* going to do a stabilization branch
that will never get seen by users? What kind of testing and bug fixing
are we going to commit to doing against it, and how do we balance that
effort with feature work?

It seems like the same conflict we have now, only since the dev
checkpoints are less important they'll lose more often. Then we'll end
up having 9 months of scheduled work to debug for a user release
instead of 5 months that slipped to 7 or 8...


What if we frame this stabilization period in terms of stability of the
test suite.  That gives us something concrete to aim for, lets us move on
when we reach some threshold, and aligns perfectly with the thing that
makes it hard to safely land new code (noisy test results)...

sage



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-25 Thread Danny Al-Gaaf

Am 25.09.2017 um 10:00 schrieb Marc Roos:
>  
> 
> But from the looks of this dovecot mailinglist post, you didn’t start 
> your project with talking to the dovecot guys, or have an ongoing 
> communication with them during the development. I would think with that 
> their experience could be a valuable asset. I am not talking about just 
> giving some files at the end.

This look is may misleading.

We discussed with the dovecot guys about the topic to get an open source
Ceph/RADOS implementation for dovecot before we started. The outcome of
the discussions was from our side to bring this project alive and
sponsor it to get a generic solution.

We discussed a generic librmb approach also e.g. with Sage and others in
the Ceph community (and there was the tracker item #12430) so that this
library can be used also in other email server projects.

Anyway: we invite everybody, especially the dovecot community, to
participate and contribute to make this project successful. The goal is
not to have a new project besides dovecot. We are more than happy to
contribute the code to the corresponding projects and then close/remove
this repo from github. This is the final goal, what you see now is
hopefully only an intermediate step.

> Ps. Is there some index of these slides? I have problems browsing back 
> to a specific one constantly.

With ESC you should get an overview or you use 'm' to get to the menu.

Danny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS Luminous | MDS frequent "replicating dir" message in log

2017-09-25 Thread David

Hi All

Since upgrading a cluster from Jewel to Luminous I'm seeing a lot of the
following line in my ceph-mds log (path name changed by me - the messages
refer to different dirs)

2017-09-25 12:47:23.073525 7f06df730700  0 mds.0.bal replicating dir [dir
0x1003e5b /path/to/dir/ [2,head] auth v=50477 cv=50465/50465 ap=0+3+4
state=1610612738|complete f(v0 m2017-03-27 11:04:17.935529 51=19+32)
n(v3297 rc2017-09-25 12:46:13.379651 b14050737379 13086=10218+2868)/n(v3297
rc2017-09-25 12:46:13.052651 b14050862881 13083=10215+2868) hs=51+0,ss=0+0
dirty=1 | child=1 dirty=1 waiter=0 authpin=0 0x7f0707298000] pop 13139 ..
rdp 191 adj 0

I've not had any issues reported, just interested to know why I'm suddenly
seeing a lot of these messages, the client versions and workload hasn't
changed. Anything to be concerned about?

Single MDS with standby-replay
Luminous 12.2.0
Kernel clients: 3.10.0-514.2.2.el7.x86_64

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Updating ceps client - what will happen to services like NFS on clients

2017-09-25 Thread David

Hi Götz

If you did a rolling upgrade, RBD clients shouldn't have experienced
interrupted IO and therefor IO to NFS exports shouldn't have been affected.
However, in the past when using kernel NFS over kernel RBD, I did have some
lockups when OSDs went down in the cluster so that's something to watch out
for.


On Mon, Sep 25, 2017 at 8:38 AM, Götz Reinicke <
goetz.reini...@filmakademie.de> wrote:

> Hi,
>
> I updated our ceph OSD/MON Nodes from 10.2.7 to 10.2.9 and everything
> looks good so far.
>
> Now I was wondering (as I may have forgotten how this works) what will
> happen to a  NFS server which has the nfs shares on a ceph rbd ? Will the
> update interrupt any access to the NFS share or is it that smooth that e.g.
> clients accessing the NFS share will not notice?
>
> Thanks for some lecture on managing ceph and regards . Götz
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD features(kernel client) with kernel version

2017-09-25 Thread Ilya Dryomov

On Sat, Sep 23, 2017 at 12:07 AM, Muminul Islam Russell
 wrote:
> Hi Ilya,
>
> Hope you are doing great.
> Sorry for bugging you. I did not find enough resources for my question.  I
> would be really helped if you could reply me. My questions are in red
> colour.
>
>  - layering: layering support:
> Kernel: 3.10 and plus, right?

Yes.

>  - striping: striping v2 support:
> What kernel is supporting this feature?

Only the default striping v2 pattern (i.e. stripe unit == object size
and stripe count == 1) is supported.

> - exclusive-lock: exclusive locking support:
> It's supposed to be 4.9. Right?

Yes.

>
>
> rest the the features below is under development? or any feature is
> available in any latest kernel?
>   - object-map: object map support (requires exclusive-lock):
>   - fast-diff: fast diff calculations (requires object-map):
>   - deep-flatten: snapshot flatten support:
>   - journaling: journaled IO support (requires exclusive-lock):

The former, none of these are available in latest kernels.

A separate data pool feature (rbd create --data-pool ) is
supported since 4.11.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson


On 09/25/2017 03:31 AM, TYLin wrote:

Hi,

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block

Seems we don’t have a formula or suggestion to the size of block.db. It depends 
on the object size and number of objects in your pool. You can just give big 
partition to block.db to ensure all the database files are on that fast 
partition. If block.db full, it will use block to put db files, however, this 
will slow down the db performance. So give db size as much as you can.


This is basically correct.  What's more, it's not just the object size, 
but the number of extents, checksums, RGW bucket indices, and 
potentially other random stuff.  I'm skeptical how well we can estimate 
all of this in the long run.  I wonder if we would be better served by 
just focusing on making it easy to understand how the DB device is being 
used, how much is spilling over to the block device, and make it easy to 
upgrade to a new device once it gets full.




If you want to put wal and db on same ssd, you don’t need to create block.wal. 
It will implicitly use block.db to put wal. The only case you need block.wal is 
that you want to separate wal to another disk.


I always make explicit partitions, but only because I (potentially 
illogically) like it that way.  There may actually be some benefits to 
using a single partition for both if sharing a single device.




I’m also studying bluestore, this is what I know so far. Any correction is 
welcomed.

Thanks



On Sep 22, 2017, at 5:27 PM, Richard Hesketh  
wrote:

I asked the same question a couple of weeks ago. No response I got contradicted 
the documentation but nobody actively confirmed the documentation was correct 
on this subject, either; my end state was that I was relatively confident I 
wasn't making some horrible mistake by simply specifying a big DB partition and 
letting bluestore work itself out (in my case, I've just got HDDs and SSDs that 
were journals under filestore), but I could not be sure there wasn't some sort 
of performance tuning I was missing out on by not specifying them separately.

Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:

Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:

On 09/21/2017 05:03 PM, Mark Nelson wrote:


On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
mailto:mrxlazuar...@gmail.com>> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Dietmar Rieder

On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
> 
> This is basically correct.  What's more, it's not just the object size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is being
> used, how much is spilling over to the block device, and make it easy to
> upgrade to a new device once it gets full.
> 
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
> 
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

> 
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>>  wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
 Some of this thread seems to contradict the documentation and confuses
 me.  Is the statement below correct?

 "The BlueStore journal will always be placed on the fastest device
 available, so using a DB device will provide the same benefit that the
 WAL device would while also allowing additional metadata to be stored
 there (if it will fix)."

 http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices


  it seems to be saying that there's no reason to create separate WAL
 and DB partitions if they are on the same device.  Specifying one
 large DB partition per OSD will cover both uses.

 thanks,
 Ben

 On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
  wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
 On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe
> someone can
> share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> mailto:mrxlazuar...@gmail.com>> wrote:
>
>     Hi,
>
>     1. Is it possible configure use osd_data not as small
> partition on
>     OSD but a folder (ex. on root disk)? If yes, how to do that
> with
>     ceph-disk and any pros/cons of doing that?
>     2. Is WAL & DB size calculated based on OSD size or expected
>     throughput like on journal device of filestore? If no, what
> is the
>     default value and pro/cons of adjusting that?
>     3. Is partition alignment matter on Bluestore, including
> WAL & DB
>     if using separate device for them?
>
>     Best regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/lis

Re: [ceph-users] Updating ceps client - what will happen to services like NFS on clients

2017-09-25 Thread David Turner

It depends a bit on how you have the RBDs mapped.  If you're mapping them
using krbd, then they don't need to be updated to use the new rbd-fuse or
rbd-nbd code.  If you're using one of the latter, then you should schedule
a time to restart the mounts so that they're mapped with the new Ceph
version.

In general RBDs are not affected by upgrades as long as you don't take down
too much of the cluster at once and are properly doing a rolling upgrade.

On Mon, Sep 25, 2017 at 8:07 AM David  wrote:

> Hi Götz
>
> If you did a rolling upgrade, RBD clients shouldn't have experienced
> interrupted IO and therefor IO to NFS exports shouldn't have been affected.
> However, in the past when using kernel NFS over kernel RBD, I did have some
> lockups when OSDs went down in the cluster so that's something to watch out
> for.
>
>
> On Mon, Sep 25, 2017 at 8:38 AM, Götz Reinicke <
> goetz.reini...@filmakademie.de> wrote:
>
>> Hi,
>>
>> I updated our ceph OSD/MON Nodes from 10.2.7 to 10.2.9 and everything
>> looks good so far.
>>
>> Now I was wondering (as I may have forgotten how this works) what will
>> happen to a  NFS server which has the nfs shares on a ceph rbd ? Will the
>> update interrupt any access to the NFS share or is it that smooth that e.g.
>> clients accessing the NFS share will not notice?
>>
>> Thanks for some lecture on managing ceph and regards . Götz
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread David Turner

db/wal partitions are per OSD.  DB partitions need to be made as big as you
need them.  If they run out of space, they will fall back to the block
device.  If the DB and block are on the same device, then there's no reason
to partition them and figure out the best size.  If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded
performance while the db partition is full).  I haven't come across an
equation to judge what size should be used for either partition yet.

On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder 
wrote:

> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.  What's more, it's not just the object size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.  I'm skeptical how well we can estimate
> > all of this in the long run.  I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is being
> > used, how much is spilling over to the block device, and make it easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you don’t need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.  There may actually be some benefits to
> > using a single partition for both if sharing a single device.
>
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single  db/wal partition" for each
> OSD  on the node?
>
> >
> >>
> >> I’m also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>>  wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>  Some of this thread seems to contradict the documentation and confuses
>  me.  Is the statement below correct?
> 
>  "The BlueStore journal will always be placed on the fastest device
>  available, so using a DB device will provide the same benefit that the
>  WAL device would while also allowing additional metadata to be stored
>  there (if it will fix)."
> 
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> 
> 
>   it seems to be saying that there's no reason to create separate WAL
>  and DB partitions if they are on the same device.  Specifying one
>  large DB partition per OSD will cover both uses.
> 
>  thanks,
>  Ben
> 
>  On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>   wrote:
> > On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >>
> >> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>  On 2017-09-21 07:56, Lazuardi Nasution wrote:
> 
> > Hi,
> >
> > I'm still looking for the answer of these questions. Maybe
> > someone can
> > share their thought on these. Any comment will be helpful too.
> >
> > Best regards,
> >
> > On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> > mailto:mrxlazuar...@gmail.com>> wrote:
> >
> > Hi,
> >
> >

Re: [ceph-users] can't figure out why I have HEALTH_WARN in luminous

2017-09-25 Thread Michael Kuriger

Thanks!!  I did see that warning, but it never occurred to me I need to disable 
it.

 
Mike Kuriger 
Sr. Unix Systems Engineer 
T: 818-649-7235 M: 818-434-6195 
 
 

On 9/23/17, 5:52 AM, "John Spray"  wrote:

On Fri, Sep 22, 2017 at 6:48 PM, Michael Kuriger  wrote:
> I have a few running ceph clusters.  I built a new cluster using luminous,
> and I also upgraded a cluster running hammer to luminous.  In both cases, 
I
> have a HEALTH_WARN that I can't figure out.  The cluster appears healthy
> except for the HEALTH_WARN in overall status.  For now, I’m monitoring
> health from the “status” instead of “overall_status” until I can find out
> what the issue is.
>
>
>
> Any ideas?  Thanks!

There is a setting called mon_health_preluminous_compat_warning (true
by default), that forces the old overall_status field to WARN, to
create the awareness that your script is using the old health output.

If you do a "ceph health detail -f json" you'll see an explanatory message.

We should probably have made that visible in "status" too (or wherever
we output the overall_status as warning like this) -

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ceph_ceph_pull_17930&d=DwIFaQ&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=kRQ0vVhTnz9rNJj4pbOQiA&m=A0xUyyGKflx20twXI038NItlc5j0OPOjFMCPdhP9rJo&s=aYUvA8rOZCJa3EDrPJY6BGg4ypCyz0wYu9FXsCisRUo&e=
 

John

>
>
> # ceph health detail
>
> HEALTH_OK
>
>
>
> # ceph -s
>
>   cluster:
>
> id: 11d436c2-1ae3-4ea4-9f11-97343e5c673b
>
> health: HEALTH_OK
>
>
>
> # ceph -s --format json-pretty
>
>
>
> {
>
> "fsid": "11d436c2-1ae3-4ea4-9f11-97343e5c673b",
>
> "health": {
>
> "checks": {},
>
> "status": "HEALTH_OK",
>
> "overall_status": "HEALTH_WARN"
>
>
>
> 
>
>
>
>
>
>
>
> Mike Kuriger
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIFaQ&c=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQ&r=kRQ0vVhTnz9rNJj4pbOQiA&m=A0xUyyGKflx20twXI038NItlc5j0OPOjFMCPdhP9rJo&s=_HTLRYg4_imosjEYnHqwbN8pHmASsm6bJ7Rs3tBa3OQ&e=
 
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS Luminous | MDS frequent "replicating dir" message in log

2017-09-25 Thread Gregory Farnum

This is supposed to indicate that the directory is hot and being replicated
to another active MDS to spread the load.

But skimming the code it looks like maybe there's a bug and this is not
blocked on the multiple-active stuff it's supposed to be. (Though I don't
anticipate any issues for you.) Patrick, Zheng, any thoughts?
-Greg

On Mon, Sep 25, 2017 at 4:59 AM David  wrote:

> Hi All
>
> Since upgrading a cluster from Jewel to Luminous I'm seeing a lot of the
> following line in my ceph-mds log (path name changed by me - the messages
> refer to different dirs)
>
> 2017-09-25 12:47:23.073525 7f06df730700  0 mds.0.bal replicating dir [dir
> 0x1003e5b /path/to/dir/ [2,head] auth v=50477 cv=50465/50465 ap=0+3+4
> state=1610612738|complete f(v0 m2017-03-27 11:04:17.935529 51=19+32)
> n(v3297 rc2017-09-25 12:46:13.379651 b14050737379 13086=10218+2868)/n(v3297
> rc2017-09-25 12:46:13.052651 b14050862881 13083=10215+2868) hs=51+0,ss=0+0
> dirty=1 | child=1 dirty=1 waiter=0 authpin=0 0x7f0707298000] pop 13139 ..
> rdp 191 adj 0
>
> I've not had any issues reported, just interested to know why I'm suddenly
> seeing a lot of these messages, the client versions and workload hasn't
> changed. Anything to be concerned about?
>
> Single MDS with standby-replay
> Luminous 12.2.0
> Kernel clients: 3.10.0-514.2.2.el7.x86_64
>
> Thanks,
> David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] question regarding filestore on Luminous

2017-09-25 Thread Alan Johnson

I am trying to compare FileStore performance against Bluestore. With Luminous 
12.20,  Bluestore is working fine but if I try and create a Filestore volume 
with a separate journal using  Jewel like Syntax - "ceph-deploy osd create 
:sdb:nvme0n1", device nvme0n1 is ignored and it sets up two partitions 
(similar to BlueStore) as shown below:
Number  Start   End SizeFile system  NameFlags
1  1049kB  106MB   105MB   xfs  ceph data
2  106MB   6001GB  6001GB   ceph block

Is this expected behavior or is FIleStore no longer supported with Luminous?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Nigel Williams

On 26 September 2017 at 01:10, David Turner  wrote:
> If they are on separate
> devices, then you need to make it as big as you need to to ensure that it
> won't spill over (or if it does that you're ok with the degraded performance
> while the db partition is full).  I haven't come across an equation to judge
> what size should be used for either partition yet.

Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".

Is there an indicator that can be monitored to show that a spill is occurring?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson




On 09/25/2017 05:02 PM, Nigel Williams wrote:

On 26 September 2017 at 01:10, David Turner  wrote:

If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded performance
while the db partition is full).  I haven't come across an equation to judge
what size should be used for either partition yet.


Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".


The WAL should never grow larger than the size of the buffers you've 
specified.  It's the DB that can grow and is difficult to estimate both 
because different workloads will cause different numbers of extents and 
objects, but also because rocksdb itself causes a certain amount of 
space-amplification due to a variety of factors.




Is there an indicator that can be monitored to show that a spill is occurring?


I think there's a message in the logs, but beyond that I don't remember 
if we added any kind of indication in the user tools.  At one point I 
think I remember Sage mentioning he wanted to add something to ceph df.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Nigel Williams

On 26 September 2017 at 08:11, Mark Nelson  wrote:
> The WAL should never grow larger than the size of the buffers you've
> specified.  It's the DB that can grow and is difficult to estimate both
> because different workloads will cause different numbers of extents and
> objects, but also because rocksdb itself causes a certain amount of
> space-amplification due to a variety of factors.

Ok, I was confused whether both types could spill. within Bluestore it
simply blocks if the WAL hits 100%?

Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Sage Weil

On Tue, 26 Sep 2017, Nigel Williams wrote:
> On 26 September 2017 at 08:11, Mark Nelson  wrote:
> > The WAL should never grow larger than the size of the buffers you've
> > specified.  It's the DB that can grow and is difficult to estimate both
> > because different workloads will cause different numbers of extents and
> > objects, but also because rocksdb itself causes a certain amount of
> > space-amplification due to a variety of factors.
> 
> Ok, I was confused whether both types could spill. within Bluestore it
> simply blocks if the WAL hits 100%?

It never blocks; it will always just spill over onto the next fastest 
device (wal -> db -> main).  Note that there is no value to a db partition 
if it is on the same device as the main partition.

> Would a drastic (quick) action to correct a too-small-DB-partition
> (impacting performance) is to destroy the OSD and rebuild it with a
> larger DB partition?

That's the easiest!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Dietmar Rieder

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:
> db/wal partitions are per OSD.  DB partitions need to be made as big as
> you need them.  If they run out of space, they will fall back to the
> block device.  If the DB and block are on the same device, then there's
> no reason to partition them and figure out the best size.  If they are
> on separate devices, then you need to make it as big as you need to to
> ensure that it won't spill over (or if it does that you're ok with the
> degraded performance while the db partition is full).  I haven't come
> across an equation to judge what size should be used for either
> partition yet.
> 
> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> mailto:dietmar.rie...@i-med.ac.at>> wrote:
> 
> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.  What's more, it's not just the object
> size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.  I'm skeptical how well we can
> estimate
> > all of this in the long run.  I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is
> being
> > used, how much is spilling over to the block device, and make it
> easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you don’t need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.  There may actually be some benefits to
> > using a single partition for both if sharing a single device.
> 
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single  db/wal partition" for each
> OSD  on the node?
> 
> >
> >>
> >> I’m also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>>  > wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible
> mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>  Some of this thread seems to contradict the documentation and
> confuses
>  me.  Is the statement below correct?
> 
>  "The BlueStore journal will always be placed on the fastest device
>  available, so using a DB device will provide the same benefit
> that the
>  WAL device would while also allowing additional metadata to be
> stored
>  there (if it will fix)."
> 
> 
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> 
> 
>   it seems to be saying that there's no reason to create
> separate WAL
>  and DB partitions if they are on the same device.  Specifying one
>  large DB partition per OSD will cover both uses.
> 
>  thanks,
>  Ben
> 
>  On Thu, Sep 21, 2017 at 12:15 PM

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

[ceph-users] Updating ceps client - what will happen to services like NFS on clients

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

Re: [ceph-users] erasure code profile

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] [RGW] SignatureDoesNotMatch using curl

Re: [ceph-users] Ceph release cadence

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

[ceph-users] CephFS Luminous | MDS frequent "replicating dir" message in log

Re: [ceph-users] Updating ceps client - what will happen to services like NFS on clients

Re: [ceph-users] RBD features(kernel client) with kernel version

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Updating ceps client - what will happen to services like NFS on clients

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] can't figure out why I have HEALTH_WARN in luminous

Re: [ceph-users] CephFS Luminous | MDS frequent "replicating dir" message in log

[ceph-users] question regarding filestore on Luminous

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

25 matches

Site Navigation

Mail list logo

Footer information