from:"David Turner"

[ceph-users] Using multisite to migrate data between bucket data pools.

2019-10-30 Thread David Turner

This is a tangent on Paul Emmerich's response to "[ceph-users] Correct
Migration Workflow Replicated -> Erasure Code". I've tried Paul's method
before to migrate between 2 data pools. However I ran into some issues.

The first issue seems like a bug in RGW where the RGW for the new zone was
able to pull data directly from the data pool of the original zone after
the metadata had been sync'd. The metadata seemed to realize the file
actually exists and so it went ahead and grabbed it from the pool backing
the other zone. I worked around that slightly by using cephx to specify
which pools each RGW user could access, but it gives a permission denied
error instead of a file not found error. This happens on buckets that are
set not to replicate as well as buckets that failed to sync properly. Seems
like a bit of a security threat, but not a super common situation at all.

The second issue I think has to do with corrupt index files in my index
pool. Some of the buckets I don't need any more so I went to delete them
for simplicity, but the command failed to delete them. I just set them
aside for now and can just set the ones that I don't need any more to not
replicate on the bucket level. That works for most things, but then I have
a few buckets that I need to migrate, but when I set them to start
replicating the data sync between zones gets stuck. Does anyone have any
ideas on how to clean up the bucket indexes to make these operations
possible?

At this point I've disabled multisite and cleared up the new zone so I can
run operations on these buckets without dealing with multisite and
replication. I've tried a few things and can get some additional
information on my specific errors tomorrow at work.


-- Forwarded message -
From: Paul Emmerich 
Date: Wed, Oct 30, 2019 at 4:32 AM
Subject: [ceph-users] Re: Correct Migration Workflow Replicated -> Erasure
Code
To: Konstantin Shalygin 
Cc: Mac Wynkoop , ceph-users 


We've solved this off-list (because I already got access to the cluster)

For the list:

Copying on rados level is possible, but requires to shut down radosgw
to get a consistent copy. This wasn't feasible here due to the size
and performance.
We've instead added a second zone where the placement maps to an EC
pool to the zonegroup and it's currently copying over data. We'll then
make the second zone master and default and ultimately delete the
first one.
This allows for a migration without downtime.

Another possibility would be using a Transition lifecycle rule, but
that's not ideal because it doesn't actually change the bucket.

I don't think it would be too complicated to add a native bucket
migration mechanism that works similar to "bucket rewrite" (which is
intended for something similar but different).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph balancer do not start

2019-10-22 Thread David Turner

Of the top of my head, if say your cluster might have the wrong tunables
for crush-compat. I know I ran into that when I first set up the balancer
and nothing obviously said that was the problem. Only researching find it
for me.

My real question, though, is why aren't you using upmap? It is
significantly better than crush-compat. Unless you have clients on really
old kernels that can't update or that are on pre-luminous Ceph versions
that can't update, there's really no reason not to use upmap.

On Mon, Oct 21, 2019, 8:08 AM Jan Peters  wrote:

> Hello,
>
> I use ceph 12.2.12 and would like to activate the ceph balancer.
>
> unfortunately no redistribution of the PGs is started:
>
> ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "crush-compat"
> }
>
> ceph balancer eval
> current cluster score 0.023776 (lower is better)
>
>
> ceph config-key dump
> {
> "initial_mon_keyring":
> "AQBLchlbABAA+5CuVU+8MB69xfc3xAXkjQ==",
> "mgr/balancer/active": "1",
> "mgr/balancer/max_misplaced:": "0.01",
> "mgr/balancer/mode": "crush-compat"
> }
>
>
> What am I not doing correctly?
>
> best regards
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread David Turner

Most times you are better served with simpler settings like
osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs
in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd,
osd_recovery_sleep_hybrid).
Using those you can tweak a specific type of OSD that might be having
problems during recovery/backfill while allowing the others to continue to
backfill at regular speeds.

Additionally you mentioned reweighting OSDs, but it sounded like you do
this manually. The balancer module, especially in upmap mode, can be
configured quite well to minimize client IO impact while balancing. You can
specify times of day that it can move data (only in UTC, it ignores local
timezones), a threshold of misplaced data that it will stop moving PGs at,
the increment size it will change weights with per operation, how many
weights it will adjust with each pass, etc.

On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood 
wrote:

> Thanks - that's a good suggestion!
>
> However I'd still like to know the answers to my 2 questions.
>
> regards
>
> Mark
>
> On 22/10/19 11:22 pm, Paul Emmerich wrote:
> > getting rid of filestore solves most latency spike issues during
> > recovery because they are often caused by random XFS hangs (splitting
> > dirs or just xfs having a bad day)
> >
> >
> > Paul
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete bucket

2019-06-27 Thread David Turner

I'm still going at 452M incomplete uploads. There are guides online for
manually deleting buckets kinda at the RADOS level that tend to leave data
stranded. That doesn't work for what I'm trying to do so I'll keep going
with this and wait for that PR to come through and hopefully help with
bucket deletion.

On Thu, Jun 27, 2019 at 2:58 PM Sergei Genchev  wrote:

> @David Turner
> Did your bucket delete ever finish? I am up to 35M incomplete uploads,
> and I doubt that I actually had that many upload attempts. I could be
> wrong though.
> Is there a way to force bucket deletion, even at the cost of not
> cleaning up space?
>
> On Tue, Jun 25, 2019 at 12:29 PM J. Eric Ivancich 
> wrote:
> >
> > On 6/24/19 1:49 PM, David Turner wrote:
> > > It's aborting incomplete multipart uploads that were left around. First
> > > it will clean up the cruft like that and then it should start actually
> > > deleting the objects visible in stats. That's my understanding of it
> > > anyway. I'm int he middle of cleaning up some buckets right now doing
> > > this same thing. I'm up to `WARNING : aborted 108393000 incomplete
> > > multipart uploads`. This bucket had a client uploading to it constantly
> > > with a very bad network connection.
> >
> > There's a PR to better deal with this situation:
> >
> > https://github.com/ceph/ceph/pull/28724
> >
> > Eric
> >
> > --
> > J. Eric Ivancich
> > he/him/his
> > Red Hat Storage
> > Ann Arbor, Michigan, USA
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete bucket

2019-06-24 Thread David Turner

It's aborting incomplete multipart uploads that were left around. First it
will clean up the cruft like that and then it should start actually
deleting the objects visible in stats. That's my understanding of it
anyway. I'm int he middle of cleaning up some buckets right now doing this
same thing. I'm up to `WARNING : aborted 108393000 incomplete multipart
uploads`. This bucket had a client uploading to it constantly with a very
bad network connection.

On Fri, Jun 21, 2019 at 1:13 PM Sergei Genchev  wrote:

>  Hello,
> Trying to delete bucket using radosgw-admin, and failing. Bucket has
> 50K objects but all of them are large. This is what I get:
> $ radosgw-admin bucket rm --bucket=di-omt-mapupdate --purge-objects
> --bypass-gc
> 2019-06-21 17:09:12.424 7f53f621f700  0 WARNING : aborted 1000
> incomplete multipart uploads
> 2019-06-21 17:09:19.966 7f53f621f700  0 WARNING : aborted 2000
> incomplete multipart uploads
> 2019-06-21 17:09:26.819 7f53f621f700  0 WARNING : aborted 3000
> incomplete multipart uploads
> 2019-06-21 17:09:33.430 7f53f621f700  0 WARNING : aborted 4000
> incomplete multipart uploads
> 2019-06-21 17:09:40.304 7f53f621f700  0 WARNING : aborted 5000
> incomplete multipart uploads
>
> Looks like it is trying to delete objects 1000 at a time, as it
> should, but failing. Bucket stats do not change.
>  radosgw-admin bucket stats --bucket=di-omt-mapupdate |jq .usage
> {
>   "rgw.main": {
> "size": 521929247648,
> "size_actual": 521930674176,
> "size_utilized": 400701129125,
> "size_kb": 509696531,
> "size_kb_actual": 509697924,
> "size_kb_utilized": 391309697,
> "num_objects": 50004
>   },
>   "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 32099
>   }
> }
> How can I get this bucket deleted?
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Changing the release cadence

2019-06-17 Thread David Turner

This was a little long to respond with on Twitter, so I thought I'd share
my thoughts here. I love the idea of a 12 month cadence. I like October
because admins aren't upgrading production within the first few months of a
new release. It gives it plenty of time to be stable for the OS distros as
well as giving admins something low-key to work on over the holidays with
testing the new releases in stage/QA.

On Mon, Jun 17, 2019 at 12:22 PM Sage Weil  wrote:

> On Wed, 5 Jun 2019, Sage Weil wrote:
> > That brings us to an important decision: what time of year should we
> > release?  Once we pick the timing, we'll be releasing at that time
> *every
> > year* for each release (barring another schedule shift, which we want to
> > avoid), so let's choose carefully!
>
> I've put up a twitter poll:
>
> https://twitter.com/liewegas/status/1140655233430970369
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Default Pools

2019-04-23 Thread David Turner

You should be able to see all pools in use in a RGW zone from the
radosgw-admin command. This [1] is probably overkill for most, but I deal
with multi-realm clusters so I generally think like this when dealing with
RGW.  Running this as is will create a file in your current directory for
each zone in your deployment (likely to be just one file).  My rough guess
for what you would find in that file based on your pool names would be this
[2].

If you identify any pools not listed from the zone get command, then you
can rename [3] the pool to see if it is being created and/or used by rgw
currently.  The process here would be to stop all RGW daemons, rename the
pools, start a RGW daemon, stop it again, and see which pools were
recreated.  Clean up the pools that were freshly made and rename the
original pools back into place before starting your RGW daemons again.
Please note that .rgw.root is a required pool in every RGW deployment and
will not be listed in the zones themselves.


[1]
for realm in $(radosgw-admin realm list --format=json | jq '.realms[]' -r);
do
  for zonegroup in $(radosgw-admin --rgw-realm=$realm zonegroup list
--format=json | jq '.zonegroups[]' -r); do
for zone in $(radosgw-admin --rgw-realm=$realm
--rgw-zonegroup=$zonegroup zone list --format=json | jq '.zones[]' -r); do
  echo $realm.$zonegroup.$zone.json
  radosgw-admin --rgw-realm=$realm --rgw-zonegroup=$zonegroup
--rgw-zone=$zone zone get > $realm.$zonegroup.$zone.json
done
  done
done

[2] default.default.default.json
{
"id": "{{ UUID }}",
"name": "default",
"domain_root": "default.rgw.meta",
"control_pool": "default.rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": "default.rgw.log",
"user_email_pool": ".users.email",
"user_uid_pool": ".users.uid",
"system_key": {
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "default.rgw.buckets.index",
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_type": 0,
"compression": ""
}
}
],
"metadata_heap": "",
"tier_config": [],
"realm_id": "{{ UUID }}"
}

[3] ceph osd pool rename  

On Thu, Apr 18, 2019 at 10:46 AM Brent Kennedy  wrote:

> Yea, that was a cluster created during firefly...
>
> Wish there was a good article on the naming and use of these, or perhaps a
> way I could make sure they are not used before deleting them.  I know RGW
> will recreate anything it uses, but I don’t want to lose data because I
> wanted a clean system.
>
> -Brent
>
> -Original Message-
> From: Gregory Farnum 
> Sent: Monday, April 15, 2019 5:37 PM
> To: Brent Kennedy 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] Default Pools
>
> On Mon, Apr 15, 2019 at 1:52 PM Brent Kennedy  wrote:
> >
> > I was looking around the web for the reason for some of the default
> pools in Ceph and I cant find anything concrete.  Here is our list, some
> show no use at all.  Can any of these be deleted ( or is there an article
> my googlefu failed to find that covers the default pools?
> >
> > We only use buckets, so I took out .rgw.buckets, .users and
> > .rgw.buckets.index…
> >
> > Name
> > .log
> > .rgw.root
> > .rgw.gc
> > .rgw.control
> > .rgw
> > .users.uid
> > .users.email
> > .rgw.buckets.extra
> > default.rgw.control
> > default.rgw.meta
> > default.rgw.log
> > default.rgw.buckets.non-ec
>
> All of these are created by RGW when you run it, not by the core Ceph
> system. I think they're all used (although they may report sizes of 0, as
> they mostly make use of omap).
>
> > metadata
>
> Except this one used to be created-by-default for CephFS metadata, but
> that hasn't been true in many releases. So I guess you're looking at an old
> cluster? (In which case it's *possible* some of those RGW pools are also
> unused now but were needed in the past; I haven't kept good track of them.)
> -Greg
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Osd update from 12.2.11 to 12.2.12

2019-04-22 Thread David Turner

Do you perhaps have anything in the ceph.conf files on the servers with
those OSDs that would attempt to tell the daemon that they are filestore
osds instead of bluestore?  I'm sure you know that the second part [1] of
the output in both cases only shows up after an OSD has been rebooted.  I'm
sure this too could be cleaned up by adding that line to the ceph.conf file.

[1] rocksdb_separate_wal_dir = 'false' (not observed, change may require
restart)

On Sun, Apr 21, 2019 at 8:32 AM Marc Roos  wrote:

>
>
> Just updated luminous, and setting max_scrubs value back. Why do I get
> osd's reporting differently
>
>
> I get these:
> osd.18: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.19: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.20: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.21: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.22: osd_max_scrubs = '1' (not observed, change may require restart)
> osd_objectstore = 'bluestore' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
>
>
> And I get osd's reporting like this:
> osd.23: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.24: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.25: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.26: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.27: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
> osd.28: osd_max_scrubs = '1' (not observed, change may require restart)
> rocksdb_separate_wal_dir = 'false' (not observed, change may require
> restart)
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd pg-upmap-items not working

2019-03-15 Thread David Turner

Why do you think that it can't resolve this by itself?  You just said that
the balancer was able to provide an optimization, but then that the
distribution isn't perfect.  When there are no further optimizations,
running `ceph balancer optimize plan` won't create a plan with any
changes.  Possibly the active mgr needs a kick.  When my cluster isn't
balancing when it's supposed to, I just run `ceph mgr fail {active mgr}`
and within a minute or so the cluster is moving PGs around.

On Sat, Mar 9, 2019 at 8:05 PM Kári Bertilsson 
wrote:

> Thanks
>
> I did apply https://github.com/ceph/ceph/pull/26179.
>
> Running manual upmap commands work now. I did run "ceph balancer optimize
> new"and It did add a few upmaps.
>
> But now another issue. Distribution is far from perfect but the balancer
> can't find further optimization.
> Specifically OSD 23 is getting way more pg's than the other 3tb OSD's.
>
> See https://pastebin.com/f5g5Deak
>
> On Fri, Mar 1, 2019 at 10:25 AM  wrote:
>
>> > Backports should be available in v12.2.11.
>>
>> s/v12.2.11/ v12.2.12/
>>
>> Sorry for the typo.
>>
>>
>>
>>
>> 原始邮件
>> *发件人：*谢型果10072465
>> *收件人：*d...@vanderster.com ;
>> *抄送人：*ceph-users@lists.ceph.com ;
>> *日 期 ：*2019年03月01日 17:09
>> *主 题 ：**Re: [ceph-users] ceph osd pg-upmap-items not working*
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> See  
>> 
>> 
>> 
>> https://github.com/ceph/ceph/pull/26179
>>
>> Backports should be available in v12.2.11.
>>
>> Or you can manually do it by simply adopting
>> 
>> 
>> 
>> 
>> 
>> https://github.com/ceph/ceph/pull/26127   if you are eager to get out of
>> the trap right now.
>>
>> 
>> 
>> 
>>
>> 
>> 
>>
>>
>>
>>
>>
>>
>> *发件人：*DanvanderSter 
>> *收件人：*Kári Bertilsson ;
>> *抄送人：*ceph-users ;谢型果10072465;
>> *日 期 ：*2019年03月01日 14:48
>> *主 题 ：**Re: [ceph-users] ceph osd pg-upmap-items not working*
>> It looks like that somewhat unusual crush rule is confusing the new
>> upmap cleaning.
>> (debug_mon 10 on the active mon should show those cleanups).
>>
>>
>> I'm copying Xie Xingguo, and probably you should create a tracker for this.
>>
>> -- dan
>>
>>
>>
>>
>> On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson > > wrote:
>> >
>> > This is the pool
>>
>> > pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
>> > rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
>> > hashpspool,ec_overwrites stripe_width 32768 application cephfs
>> >removed_snaps [1~5]
>> >
>> > Here is the relevant crush rule:
>>
>> > rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
>> > set_chooseleaf_tries 5 step set_choose_tries 100 step take default class 
>> > hdd step choose indep 5 type host step choose indep 2 type osd step emit }
>> >
>>
>> > Both OSD 23 and 123 are in the same host. So this change should be 
>> > perfectly acceptable by the rule set.
>>
>> > Something must be blocking the change, but i can't find anything about it 
>> > in any logs.
>> >
>> > - Kári
>> >
>> > On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster > > wrote:
>> >>
>> >> Hi,
>> >>
>> >> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
>> >> E.g., it now won't let you put two PGs in the same rack if the crush
>> >> rule doesn't allow it.
>> >>
>>
>> >> Where are OSDs 23 and 123 in your cluster? What is the relevant crush 
>> >> rule?
>> >>
>> >> -- dan
>> >>
>> >>
>> >> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson > > wrote:
>> >> >
>> >> > Hello
>> >> >
>>
>> >> > I am trying to diagnose why upmap stopped working where it was 
>> >> > previously working fine.
>> >> >
>> >> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>> >> >
>> >> > # ceph osd pg-upmap-items 41.1 23 123
>> >> > set 41.1 pg_upmap_items mapping to [23->123]
>> >> >
>>
>> >> > No rebalacing happens and if i run it again it shows the same output 
>> >> > every time.
>> >> >
>> >> > I have in config
>> >> > debug mgr = 4/5
>> >> > debug mon = 4/5
>> >> >
>> >> > Paste from mon & mgr logs. Also output from "ceph osd dump"
>> >> > https://pastebin.com/9VrT4YcU
>> >> >
>> >> >
>>
>> >> > I have run "ceph osd set-require-min-compat-client luminous" long time 
>> >> > ago. And all servers running ceph have been rebooted numerous times 
>> >> > since then.
>>
>> >> > But someho

Re: [ceph-users] OpenStack with Ceph RDMA

2019-03-11 Thread David Turner

I can't speak to the rdma portion. But to clear up what each of these
does... the cluster network is only traffic between the osds for
replicating writes, reading EC data, as well as backfilling and recovery
io. Mons, mds, rgw, and osds talking with clients all happen on the public
network. The general consensus has been to not split the two networks,
except for maybe by vlans for potential statistics and graphing. Even if
you were running out of bandwidth, just upgrade the dual interface instead
of segregating them physically.

On Sat, Mar 9, 2019, 11:10 AM Lazuardi Nasution 
wrote:

> Hi,
>
> I'm looking for information about where is the RDMA messaging of Ceph
> happen, on cluster network, public network or both (it seem both, CMIIW)?
> I'm talking about configuration of ms_type, ms_cluster_type and
> ms_public_type.
>
> In case of OpenStack integration with RBD, which of above three is
> possible? In this case, should I still separate cluster network and public
> network?
>
> Best regards,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] priorize degraged objects than misplaced

2019-03-11 Thread David Turner

Ceph has been getting better and better about prioritizing this sorry of
recovery, but free of those optimizations are in Jewel, which had been out
of the support cycle for about a year. You should look into upgrading to
mimic where you should see a pretty good improvement on this sorry of
prioritization.

On Sat, Mar 9, 2019, 3:10 PM Fabio Abreu  wrote:

> HI Everybody,
>
> I have a doubt about degraded objects in the Jewel 10.2.7 version, can I
> priorize the degraded objects than misplaced?
>
> I asking this because I try simulate a disaster recovery scenario.
>
>
> Thanks and best regards,
> Fabio Abreu Reis
> http://fajlinux.com.br
> *Tel : *+55 21 98244-0161
> *Skype : *fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH ISCSI Gateway

2019-03-11 Thread David Turner

The problem with clients on osd nodes is for kernel clients only. That's
true of krbd and the kernel client for cephfs. The only other reason not to
run any other Ceph daemon in the same node as osds is resource contention
if you're running at higher CPU and memory utilizations.

On Sat, Mar 9, 2019, 10:15 PM Mike Christie  wrote:

> On 03/07/2019 09:22 AM, Ashley Merrick wrote:
> > Been reading into the gateway, and noticed it’s been mentioned a few
> > times it can be installed on OSD servers.
> >
> > I am guessing therefore there be no issues like is sometimes mentioned
> > when using kRBD on a OSD node apart from the extra resources required
> > from the hardware.
> >
>
> That is correct. You might have a similar issue if you were to run the
> iscsi gw/target, OSD and then also run the iscsi initiator that logs
> into the iscsi gw/target all on the same node. I don't think any use
> case like that has ever come up though.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-03-01 Thread David Turner

True, but not before you unmap it from the previous server. It's like
physically connecting a harddrive to two servers at the same time. Neither
knows what the other is doing to it and can corrupt your data. You should
always make sure to unmap an rbd before mapping it to another server.

On Fri, Mar 1, 2019, 6:28 PM solarflow99  wrote:

> It has to be mounted from somewhere, if that server goes offline, you need
> to mount it from somewhere else right?
>
>
> On Thu, Feb 28, 2019 at 11:15 PM David Turner 
> wrote:
>
>> Why are you making the same rbd to multiple servers?
>>
>> On Wed, Feb 27, 2019, 9:50 AM Ilya Dryomov  wrote:
>>
>>> On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> > I have noticed an error when writing to a mapped RBD.
>>> > Therefore I unmounted the block device.
>>> > Then I tried to unmap it w/o success:
>>> > ld2110:~ # rbd unmap /dev/rbd0
>>> > rbd: sysfs write failed
>>> > rbd: unmap failed: (16) Device or resource busy
>>> >
>>> > The same block device is mapped on another client and there are no
>>> issues:
>>> > root@ld4257:~# rbd info hdb-backup/ld2110
>>> > rbd image 'ld2110':
>>> > size 7.81TiB in 2048000 objects
>>> > order 22 (4MiB objects)
>>> > block_name_prefix: rbd_data.3cda0d6b8b4567
>>> > format: 2
>>> > features: layering
>>> > flags:
>>> > create_timestamp: Fri Feb 15 10:53:50 2019
>>> > root@ld4257:~# rados -p hdb-backup  listwatchers
>>> rbd_data.3cda0d6b8b4567
>>> > error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
>>> > file or directory
>>> > root@ld4257:~# rados -p hdb-backup  listwatchers
>>> rbd_header.3cda0d6b8b4567
>>> > watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
>>> > watcher=10.97.206.97:0/4023931980 client.18484780
>>> > cookie=18446462598732841027
>>> >
>>> >
>>> > Question:
>>> > How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?
>>>
>>> Hi Thomas,
>>>
>>> It appears that /dev/rbd0 is still open on that node.
>>>
>>> Was the unmount successful?  Which filesystem (ext4, xfs, etc)?
>>>
>>> What is the output of "ps aux | grep rbd" on that node?
>>>
>>> Try lsof, fuser, check for LVM volumes and multipath -- these have been
>>> reported to cause this issue previously:
>>>
>>>   http://tracker.ceph.com/issues/12763
>>>
>>> Thanks,
>>>
>>> Ilya
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic 13.2.4 rbd du slowness

2019-02-28 Thread David Turner

Have you used strace on the du command to see what it's spending its time
doing?

On Thu, Feb 28, 2019, 8:45 PM Glen Baars 
wrote:

> Hello Wido,
>
> The cluster layout is as follows:
>
> 3 x Monitor hosts ( 2 x 10Gbit bonded )
> 9 x OSD hosts (
> 2 x 10Gbit bonded,
> LSI cachecade and write cache drives set to single,
> All HDD in this pool,
> no separate DB / WAL. With the write cache and the SSD read cache on the
> LSI card it seems to perform well.
> 168 OSD disks
>
> No major increase in OSD disk usage or CPU usage. The RBD DU process uses
> 100% of a single 2.4Ghz core while running - I think that is the limiting
> factor.
>
> I have just tried removing most of the snapshots for that volume ( from 14
> snapshots down to 1 snapshot ) and the rbd du command now takes around 2-3
> minutes.
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Wido den Hollander 
> Sent: Thursday, 28 February 2019 5:05 PM
> To: Glen Baars ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
>
>
>
> On 2/28/19 9:41 AM, Glen Baars wrote:
> > Hello Wido,
> >
> > I have looked at the libvirt code and there is a check to ensure that
> fast-diff is enabled on the image and only then does it try to get the real
> disk usage. The issue for me is that even with fast-diff enabled it takes
> 25min to get the space usage for a 50TB image.
> >
> > I had considered turning off fast-diff on the large images to get
> > around to issue but I think that will hurt my snapshot removal times (
> > untested )
> >
>
> Can you tell a bit more about the Ceph cluster? HDD? SSD? DB and WAL on
> SSD?
>
> Do you see OSDs spike in CPU or Disk I/O when you do a 'rbd du' on these
> images?
>
> Wido
>
> > I can't see in the code any other way of bypassing the disk usage check
> but I am not that familiar with the code.
> >
> > ---
> > if (volStorageBackendRBDUseFastDiff(features)) {
> > VIR_DEBUG("RBD image %s/%s has fast-diff feature enabled. "
> >   "Querying for actual allocation",
> >   def->source.name, vol->name);
> >
> > if (virStorageBackendRBDSetAllocation(vol, image, &info) < 0)
> > goto cleanup;
> > } else {
> > vol->target.allocation = info.obj_size * info.num_objs; }
> > --
> >
> > Kind regards,
> > Glen Baars
> >
> > -Original Message-
> > From: Wido den Hollander 
> > Sent: Thursday, 28 February 2019 3:49 PM
> > To: Glen Baars ;
> > ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Mimic 13.2.4 rbd du slowness
> >
> >
> >
> > On 2/28/19 2:59 AM, Glen Baars wrote:
> >> Hello Ceph Users,
> >>
> >> Has anyone found a way to improve the speed of the rbd du command on
> large rbd images? I have object map and fast diff enabled - no invalid
> flags on the image or it's snapshots.
> >>
> >> We recently upgraded our Ubuntu 16.04 KVM servers for Cloudstack to
> Ubuntu 18.04. The upgrades libvirt to version 4. When libvirt 4 adds an rbd
> pool it discovers all images in the pool and tries to get their disk usage.
> We are seeing a 50TB image take 25min. The pool has over 300TB of images in
> it and takes hours for libvirt to start.
> >>
> >
> > This is actually a pretty bad thing imho. As a lot of images people will
> be using do not have fast-diff enabled (images from the past) and that will
> kill their performance.
> >
> > Isn't there a way to turn this off in libvirt?
> >
> > Wido
> >
> >> We can replicate the issue without libvirt by just running a rbd du on
> the large images. The limiting factor is the cpu on the rbd du command, it
> uses 100% of a single core.
> >>
> >> Our cluster is completely bluestore/mimic 13.2.4. 168 OSDs, 12 Ubuntu
> 16.04 hosts.
> >>
> >> Kind regards,
> >> Glen Baars
> >> This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail in error, please notify us immediately.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > This e-mail is intended solely for the benefit of the addressee(s) and
> any other named recipient. It is confidential and may contain legally
> privileged or confidential information. If you are not the recipient, any
> use, distribution, disclosure or copying of this e-mail is prohibited. The
> confidentiality and legal privilege attached to this communication is not
> waived or lost by reason of the mistaken transmission or delivery to you.
> If you have received this e-mail

Re: [ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-02-28 Thread David Turner

Why are you making the same rbd to multiple servers?

On Wed, Feb 27, 2019, 9:50 AM Ilya Dryomov  wrote:

> On Wed, Feb 27, 2019 at 12:00 PM Thomas <74cmo...@gmail.com> wrote:
> >
> > Hi,
> > I have noticed an error when writing to a mapped RBD.
> > Therefore I unmounted the block device.
> > Then I tried to unmap it w/o success:
> > ld2110:~ # rbd unmap /dev/rbd0
> > rbd: sysfs write failed
> > rbd: unmap failed: (16) Device or resource busy
> >
> > The same block device is mapped on another client and there are no
> issues:
> > root@ld4257:~# rbd info hdb-backup/ld2110
> > rbd image 'ld2110':
> > size 7.81TiB in 2048000 objects
> > order 22 (4MiB objects)
> > block_name_prefix: rbd_data.3cda0d6b8b4567
> > format: 2
> > features: layering
> > flags:
> > create_timestamp: Fri Feb 15 10:53:50 2019
> > root@ld4257:~# rados -p hdb-backup  listwatchers rbd_data.3cda0d6b8b4567
> > error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
> > file or directory
> > root@ld4257:~# rados -p hdb-backup  listwatchers
> rbd_header.3cda0d6b8b4567
> > watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
> > watcher=10.97.206.97:0/4023931980 client.18484780
> > cookie=18446462598732841027
> >
> >
> > Question:
> > How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?
>
> Hi Thomas,
>
> It appears that /dev/rbd0 is still open on that node.
>
> Was the unmount successful?  Which filesystem (ext4, xfs, etc)?
>
> What is the output of "ps aux | grep rbd" on that node?
>
> Try lsof, fuser, check for LVM volumes and multipath -- these have been
> reported to cause this issue previously:
>
>   http://tracker.ceph.com/issues/12763
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG Calculations Issue

2019-02-28 Thread David Turner

Those numbers look right for a pool only containing 10% of your data. Now
continue to calculate the pg counts for the remaining 90% of your data.

On Wed, Feb 27, 2019, 12:17 PM Krishna Venkata 
wrote:

> Greetings,
>
>
> I am having issues in the way PGs are calculated in
> https://ceph.com/pgcalc/ [Ceph PGs per Pool Calculator ] and the formulae
> mentioned in the site.
>
> Below are my findings
>
> The formula to calculate PGs as mentioned in the https://ceph.com/pgcalc/
>  :
>
> 1.  Need to pick the highest value from either of the formulas
>
> *(( Target PGs per OSD ) x ( OSD # ) x ( %Data ))/(size)*
>
> Or
>
> *( OSD# ) / ( Size )*
>
> 2.  The output value is then rounded to the nearest power of 2
>
>1. If the nearest power of 2 is more than 25% below the original
>value, the next higher power of 2 is used.
>
>
>
> Based on the above procedure, we calculated PGs for 25, 32 and 64 OSDs
>
> *Our Dataset:*
>
> *%Data:* 0.10
>
> *Target PGs per OSD:* 100
>
> *OSDs* 25, 32 and 64
>
>
>
> *For 25 OSDs*
>
>
>
> (100*25* (0.10/100))/(3) = 0.833
>
>
>
> ( 25 ) / ( 3 ) = 8.33
>
>
>
> 1. Raw pg num 8.33  ( Since we need to pick the highest of (0.833, 8.33))
>
> 2. max pg 16 ( For, 8.33 the nearest power of 2 is 16)
>
> 3. 16 > 2.08  ( 25 % of 8.33 is 2.08 which is more than 25% the power of 2)
>
>
>
> So 16 PGs
>
> ü  GUI Calculator gives the same value and matches with Formula.
>
>
>
> *For 32 OSD*
>
>
>
> (100*32*(0.10/100))/3 = 1.066
>
> ( 32 ) / ( 3 ) = 10.66
>
>
>
> 1. Raw pg num 10.66 ( Since we need to pick the highest of (1.066, 10.66))
>
> 2. max pg 16 ( For, 10.66 the nearest power of 2 is 16)
>
> 3.  16 > 2.655 ( 25 % of 10.66 is 2.655 which is more than 25% the power
> of 2)
>
>
>
> So 16 PGs
>
> û  GUI Calculator gives different value (32 PGs) which doesn’t match with
> Formula.
>
>
>
> *For 64 OSD*
>
>
>
> (100 * 64 * (0.10/100))/3 = 2.133
>
> ( 64 ) / ( 3 ) 21.33
>
>
>
> 1. Raw pg num 21.33 ( Since we need to pick the highest of (2.133, 21.33))
>
> 2. max pg 32 ( For, 21.33 the nearest power of 2 is 32)
>
> 3. 32 > 5.3325 ( 25 % of 21.33 is 5.3325 which is more than 25% the power
> of 2)
>
>
>
> So 32 PGs
>
> û  GUI Calculator gives different value (64 PGs) which doesn’t match with
> Formula.
>
>
>
> We checked the PG calculator logic from [
> https://ceph.com/pgcalc_assets/pgcalc.js ] which is not matching from
> above formulae.
>
>
>
> Can someone Guide/reference us to correct formulae to calculate PGs.
>
>
>
> Thanks in advance.
>
>
>
> Regards,
>
> Krishna Venkata
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] redirect log to syslog and disable log to stderr

2019-02-28 Thread David Turner

You can always set it in your ceph.conf file and restart the mgr daemon.

On Tue, Feb 26, 2019, 1:30 PM Alex Litvak 
wrote:

> Dear Cephers,
>
> In mimic 13.2.2
> ceph tell mgr.* injectargs --log-to-stderr=false
> Returns an error (no valid command found ...).  What is the correct way to
> inject mgr configuration values?
>
> The same command works on mon
>
> ceph tell mon.* injectargs --log-to-stderr=false
>
>
> Thank you in advance,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Right way to delete OSD from cluster?

2019-02-28 Thread David Turner

The reason is that an osd still contributes to the host weight in the crush
map even while it is marked out. When you out and then purge, the purging
operation removed the osd from the map and changes the weight of the host
which changes the crush map and data moves. By weighting the osd to 0.0,
the hosts weight is already the same it will be when you purge the osd.
Weighting to 0.0 is definitely the best option for removing storage if you
can trust the data on the osd being removed.

On Tue, Feb 26, 2019, 3:19 AM Fyodor Ustinov  wrote:

> Hi!
>
> Thank you so much!
>
> I do not understand why, but your variant really causes only one rebalance
> compared to the "osd out".
>
> - Original Message -
> From: "Scottix" 
> To: "Fyodor Ustinov" 
> Cc: "ceph-users" 
> Sent: Wednesday, 30 January, 2019 20:31:32
> Subject: Re: [ceph-users] Right way to delete OSD from cluster?
>
> I generally have gone the crush reweight 0 route
> This way the drive can participate in the rebalance, and the rebalance
> only happens once. Then you can take it out and purge.
>
> If I am not mistaken this is the safest.
>
> ceph osd crush reweight  0
>
> On Wed, Jan 30, 2019 at 7:45 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > But unless after "ceph osd crush remove" I will not got the undersized
> objects? That is, this is not the same thing as simply turning off the OSD
> and waiting for the cluster to be restored?
> >
> > - Original Message -
> > From: "Wido den Hollander" 
> > To: "Fyodor Ustinov" , "ceph-users" <
> ceph-users@lists.ceph.com>
> > Sent: Wednesday, 30 January, 2019 15:05:35
> > Subject: Re: [ceph-users] Right way to delete OSD from cluster?
> >
> > On 1/30/19 2:00 PM, Fyodor Ustinov wrote:
> > > Hi!
> > >
> > > I thought I should first do "ceph osd out", wait for the end
> relocation of the misplaced objects and after that do "ceph osd purge".
> > > But after "purge" the cluster starts relocation again.
> > >
> > > Maybe I'm doing something wrong? Then what is the correct way to
> delete the OSD from the cluster?
> > >
> >
> > You are not doing anything wrong, this is the expected behavior. There
> > are two CRUSH changes:
> >
> > - Marking it out
> > - Purging it
> >
> > You could do:
> >
> > $ ceph osd crush remove osd.X
> >
> > Wait for all good
> >
> > $ ceph osd purge X
> >
> > The last step should then not initiate any data movement.
> >
> > Wido
> >
> > > WBR,
> > > Fyodor.
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> T: @Thaumion
> IG: Thaumion
> scot...@gmail.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Usenix Vault 2019

2019-02-24 Thread David Turner

There is a scheduled birds of a feather for Ceph tomorrow night, but I also
noticed that there are only trainings tomorrow. Unless you are paying more
for those, you likely don't have much to do on Monday. That's the boat I'm
in. Is anyone interested in getting together tomorrow in Boston during the
training day?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread David Turner

One thing that's worked for me to get more out of nvmes with Ceph is to
create multiple partitions on the nvme with an osd on each partition. That
way you get more osd processes and CPU per nvme device. I've heard of
people using up to 4 partitions like this.

On Sun, Feb 24, 2019, 10:25 AM Vitaliy Filippov  wrote:

> > We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
> > per OSD.by rados.
>
> Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :)
> some colleagues tell that SPDK works out of the box, but almost doesn't
> increase performance, because the userland-kernel interaction isn't the
> bottleneck currently, it's Ceph code itself. I also tried once, but I
> couldn't make it work. When I have some spare NVMe's I'll make another
> attempt.
>
> So... try it and share your results here :) we're all interested.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Doubts about backfilling performance

2019-02-23 Thread David Turner

Jewel is really limited on the settings you can tweak for backfilling [1].
Luminous and Mimic have a few more knobs. An option you can do, though, is
to use osd_crush_initial_weight found [2] here. With this setting you set
your initial crush weight for new osds to 0.0 and gradually increase them
to what you want them to be. This doesn't help with already added osds, but
can help in the future.

[1]
http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/#backfilling
[2] http://docs.ceph.com/docs/jewel/rados/configuration/pool-pg-config-ref/

On Sat, Feb 23, 2019, 6:08 AM Fabio Abreu  wrote:

> Hello everybody,
>
> I try to improve the backfilling proccess without impact my client I/O,
> that is a painfull thing  when i putted a new osd in my environment.
>
> I look some options like osd backfill scan max , Can I improve the
> performance if I reduce this ?
>
> Someome recommend parameter to study in my scenario.
>
> My environment is jewel 10.2.7 .
>
> Best Regards,
> Fabio Abreu
> --
> Atenciosamente,
> Fabio Abreu Reis
> http://fajlinux.com.br
> *Tel : *+55 21 98244-0161
> *Skype : *fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner

Mon disks don't have journals, they're just a folder on a filesystem on a
disk.

On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy 
wrote:

> ceph mons looks fine during the recovery.  Using  HDD with SSD
> journals. with recommeded CPU and RAM numbers.
>
> On Fri, Feb 22, 2019 at 4:40 PM David Turner 
> wrote:
> >
> > What about the system stats on your mons during recovery? If they are
> having a hard time keeping up with requests during a recovery, I could see
> that impacting client io. What disks are they running on? CPU? Etc.
> >
> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> wrote:
> >>
> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >> Shall I try with 0 for all debug settings?
> >>
> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> > Check your CPU usage when you are doing those kind of operations. We
> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
> >> > try disabling ht and see the actual cpu usage.
> >> > If you are hitting CPU limits you can try disabling crc on messages.
> >> > ms_nocrc
> >> > ms_crc_data
> >> > ms_crc_header
> >> >
> >> > And setting all your debug messages to 0.
> >> > If you haven't done you can also lower your recovery settings a
> little.
> >> > osd recovery max active
> >> > osd max backfills
> >> >
> >> > You can also lower your file store threads.
> >> > filestore op threads
> >> >
> >> >
> >> > If you can also switch to bluestore from filestore. This will also
> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >> > compared to filestore + leveldb .
> >> >
> >> >
> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >> >  wrote:
> >> > >
> >> > > Thats expected from Ceph by design. But in our case, we are using
> all
> >> > > recommendation like rack failure domain, replication n/w,etc, still
> >> > > face client IO performance issues during one OSD down..
> >> > >
> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <
> drakonst...@gmail.com> wrote:
> >> > > >
> >> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> >> > > >
> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> >> > > >>
> >> > > >> Hello - I have a couple of questions on ceph cluster stability,
> even
> >> > > >> we follow all recommendations as below:
> >> > > >> - Having separate replication n/w and data n/w
> >> > > >> - RACK is the failure domain
> >> > > >> - Using SSDs for journals (1:4ratio)
> >> > > >>
> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer
> Apps impacted.
> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> > > >> workable condition, if one osd down or one node down,etc.
> >> > > >>
> >> > > >> Thanks
> >> > > >> Swami
> >> > > >> ___
> >> > > >> ceph-users mailing list
> >> > > >> ceph-users@lists.ceph.com
> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > ___
> >> > > ceph-users mailing list
> >> > > ceph-users@lists.ceph.com
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

2019-02-22 Thread David Turner

Can you correlate the times to scheduled tasks inside of any VMs? For
instance if you have several Linux VMs with the updatedb command installed
that by default they will all be scanning their disks at the same time each
day to see where files are. Other common culprits could be scheduled
backups, db cleanup, etc. Do you track cluster io at all? When I first
configured a graphing tool on my home cluster I found the updatedb/locate
command happening with a drastic io spike at the same time every day. I
also found a spike when a couple Windows VMs were checking for updates
automatically.

On Fri, Feb 22, 2019, 4:28 AM mart.v  wrote:

> Hello everyone,
>
> I'm experiencing a strange behaviour. My cluster is relatively small (43
> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected
> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with
> different pools. Descibed error is only on the SSD part of the cluster.
>
> I noticed that few times a day the cluster slows down a bit and I have
> discovered this in logs:
>
> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159
> : cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec.
> Implicated osds 10,22,33 (REQUEST_SLOW)
> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169
> : cluster [WRN] Health check update: 199 slow requests are blocked > 32
> sec. Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41
> (REQUEST_SLOW)
> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183
> : cluster [WRN] Health check update: 448 slow requests are blocked > 32
> sec. Implicated osds
> 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 (REQUEST_SLOW)
> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210
> : cluster [WRN] Health check update: 388 slow requests are blocked > 32
> sec. Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214
> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests
> are blocked > 32 sec. Implicated osds 8,16)
>
> "ceph health detail" shows nothing more
>
> It is happening through the whole day and the times can't be linked to any
> read or write intensive task (e.g. backup). I also tried to disable
> scrubbing, but it kept on going. These errors were not there since
> beginning, but unfortunately I cannot track the day they started (it is
> beyond my logs).
>
> Any ideas?
>
> Thank you!
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner

What about the system stats on your mons during recovery? If they are
having a hard time keeping up with requests during a recovery, I could see
that impacting client io. What disks are they running on? CPU? Etc.

On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
wrote:

> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> Shall I try with 0 for all debug settings?
>
> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >
> > Hello,
> >
> >
> > Check your CPU usage when you are doing those kind of operations. We
> > had a similar issue where our CPU monitoring was reporting fine < 40%
> > usage, but our load on the nodes was high mid 60-80. If it's possible
> > try disabling ht and see the actual cpu usage.
> > If you are hitting CPU limits you can try disabling crc on messages.
> > ms_nocrc
> > ms_crc_data
> > ms_crc_header
> >
> > And setting all your debug messages to 0.
> > If you haven't done you can also lower your recovery settings a little.
> > osd recovery max active
> > osd max backfills
> >
> > You can also lower your file store threads.
> > filestore op threads
> >
> >
> > If you can also switch to bluestore from filestore. This will also
> > lower your CPU usage. I'm not sure that this is bluestore that does
> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> > compared to filestore + leveldb .
> >
> >
> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Thats expected from Ceph by design. But in our case, we are using all
> > > recommendation like rack failure domain, replication n/w,etc, still
> > > face client IO performance issues during one OSD down..
> > >
> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> wrote:
> > > >
> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> > > >
> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > > >>
> > > >> Hello - I have a couple of questions on ceph cluster stability, even
> > > >> we follow all recommendations as below:
> > > >> - Having separate replication n/w and data n/w
> > > >> - RACK is the failure domain
> > > >> - Using SSDs for journals (1:4ratio)
> > > >>
> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > > >> workable condition, if one osd down or one node down,etc.
> > > >>
> > > >> Thanks
> > > >> Swami
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] faster switch to another mds

2019-02-20 Thread David Turner

If I'm not mistaken, if you stop them at the same time during a reboot on a
node with both mds and mon, the mons might receive it, but wait to finish
their own election vote before doing anything about it.  If you're trying
to keep optimal uptime for your mds, then stopping it first and on its own
makes sense.

On Wed, Feb 20, 2019 at 3:46 PM Patrick Donnelly 
wrote:

> On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > From documentation:
> >
> > mds beacon grace
> > Description:The interval without beacons before Ceph declares an MDS
> laggy (and possibly replace it).
> > Type:   Float
> > Default:15
> >
> > I do not understand, 15 - are is seconds or beacons?
>
> seconds
>
> > And an additional misunderstanding - if we gently turn off the MDS (or
> MON), why it does not inform everyone interested before death - "I am
> turned off, no need to wait, appoint a new active server"
>
> The MDS does inform the monitors if it has been shutdown. If you pull
> the plug or SIGKILL, it does not. :)
>
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] faster switch to another mds

2019-02-19 Thread David Turner

It's also been mentioned a few times that when MDS and MON are on the same
host that the downtime for MDS is longer when both daemons stop at about
the same time.  It's been suggested to stop the MDS daemon, wait for `ceph
mds stat` to reflect the change, and then restart the rest of the server.
HTH.

On Mon, Feb 11, 2019 at 3:55 PM Gregory Farnum  wrote:

> You can't tell from the client log here, but probably the MDS itself was
> failing over to a new instance during that interval. There's not much
> experience with it, but you could experiment with faster failover by
> reducing the mds beacon and grace times. This may or may not work
> reliably...
>
> On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:
>
>> Hi!
>>
>> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
>> I reboot one node and see this in client log:
>>
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
>> closed (con state OPEN)
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
>> lost, hunting for new mon
>> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
>> established
>> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state OPEN)
>> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
>> closed (con state CONNECTING)
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
>> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
>> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>>
>> As I understand it, the following has happened:
>> 1. Client detects - link with mon server broken and fast switches to
>> another mon (less that 1 seconds).
>> 2. Client detects - link with mds server broken, 3 times trying reconnect
>> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
>> downtime.
>>
>> I have 2 questions:
>> 1. Why?
>> 2. How to reduce switching time to another mds?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-19 Thread David Turner

If your client needs to be able to handle the writes like that on its own,
RBDs might be the more appropriate use case.  You lose the ability to have
multiple clients accessing the data as easily as with CephFS, but you would
gain the features you're looking for.

On Tue, Feb 12, 2019 at 1:43 PM Gregory Farnum  wrote:

>
>
> On Tue, Feb 12, 2019 at 5:10 AM Hector Martin 
> wrote:
>
>> On 12/02/2019 06:01, Gregory Farnum wrote:
>> > Right. Truncates and renames require sending messages to the MDS, and
>> > the MDS committing to RADOS (aka its disk) the change in status, before
>> > they can be completed. Creating new files will generally use a
>> > preallocated inode so it's just a network round-trip to the MDS.
>>
>> I see. Is there a fundamental reason why these kinds of metadata
>> operations cannot be buffered in the client, or is this just the current
>> way they're implemented?
>>
>
> It's pretty fundamental, at least to the consistency guarantees we hold
> ourselves to. What happens if the client has buffered an update like that,
> performs writes to the data with those updates in mind, and then fails
> before they're flushed to the MDS? A local FS doesn't need to worry about a
> different node having a different lifetime, and can control the write order
> of its metadata and data updates on belated flush a lot more precisely than
> we can. :(
> -Greg
>
>
>>
>> e.g. on a local FS these kinds of writes can just stick around in the
>> block cache unflushed. And of course for CephFS I assume file extension
>> also requires updating the file size in the MDS, yet that doesn't block
>> while truncation does.
>>
>> > Going back to your first email, if you do an overwrite that is confined
>> > to a single stripe unit in RADOS (by default, a stripe unit is the size
>> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
>> > to be atomic. CephFS can only tear writes across objects, and only if
>> > your client fails before the data has been flushed.
>>
>> Great! I've implemented this in a backwards-compatible way, so that gets
>> rid of this bottleneck. It's just a 128-byte flag file (formerly
>> variable length, now I just pad it to the full 128 bytes and rewrite it
>> in-place). This is good information to know for optimizing things :-)
>>
>> --
>> Hector Martin (hec...@marcansoft.com)
>> Public Key: https://mrcn.st/pub
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: client hangs

2019-02-19 Thread David Turner

You're attempting to use mismatching client name and keyring.  You want to
use matching name and keyring.  For your example, you would want to either
use `--keyring /etc/ceph/ceph.client.admin.keyring --name client.admin` or
`--keyring /etc/ceph/ceph.client.cephfs.keyring --name client.cephfs`.
Mixing and matching does not work.  Treat them like username and password.
You wouldn't try to log into your computer under your account with the
admin password.

On Tue, Feb 19, 2019 at 12:58 PM Hennen, Christian <
christian.hen...@uni-trier.de> wrote:

> > sounds like network issue. are there firewall/NAT between nodes?
> No, there is currently no firewall in place. Nodes and clients are on the
> same network. MTUs match, ports are opened according to nmap.
>
> > try running ceph-fuse on the node that run mds, check if it works
> properly.
> When I try to run ceph-fuse on either a client or cephfiler1
> (MON,MGR,MDS,OSDs) I get
> - "operation not permitted" when using the client keyring
> - "invalid argument" when using the admin keyring
> - "ms_handle_refused" when using the admin keyring and connecting to
> 127.0.0.1:6789
>
> ceph-fuse --keyring /etc/ceph/ceph.client.admin.keyring --name
> client.cephfs -m 192.168.1.17:6789 /mnt/cephfs
>
> -Ursprüngliche Nachricht-
> Von: Yan, Zheng 
> Gesendet: Dienstag, 19. Februar 2019 11:31
> An: Hennen, Christian 
> Cc: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] CephFS: client hangs
>
> On Tue, Feb 19, 2019 at 5:10 PM Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted each server (MONs and OSDs weren’t enough) and now the
> health warning is gone. Still no luck accessing CephFS though.
> >
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Aside from the fact that evicted clients don’t show up in ceph –s, we
> observe other strange things:
> >
> > ·   Setting max_mds has no effect
> >
> > ·   Ceph osd blacklist ls sometimes lists cluster nodes
> >
>
> sounds like network issue. are there firewall/NAT between nodes?
>
> > The only client that is currently running is ‚master1‘. It also hosts a
> MON and a MGR. Its syslog (https://gitlab.uni-trier.de/snippets/78) shows
> messages like:
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943008] libceph: wrong peer,
> > want 192.168.1.17:6800/-2045158358, got 192.168.1.17:6800/1699349984
> >
> > Feb 13 06:40:33 master1 kernel: [56165.943014] libceph: mds1
> > 192.168.1.17:6800 wrong peer at address
> >
> > The other day I did the update from 12.2.8 to 12.2.11, which can also be
> seen in the logs. Again, there appeared these messages. I assume that’s
> normal operations since ports can change and daemons have to find each
> other again? But what about Feb 13 in the morning? I didn’t do any restarts
> then.
> >
> > Also, clients are printing messages like the following on the console:
> >
> > [1026589.751040] ceph: handle_cap_import: mismatched seq/mseq: ino
> > (1994988.fffe) mds0 seq1 mseq 15 importer mds1 has
> > peer seq 2 mseq 15
> >
> > [1352658.876507] ceph: build_path did not end path lookup where
> > expected, namelen is 23, pos is 0
> >
> > Oh, and btw, the ceph nodes are running on Ubuntu 16.04, clients are on
> 14.04 with kernel 4.4.0-133.
> >
>
> try running ceph-fuse on the node that run mds, check if it works properly.
>
>
> > For reference:
> >
> > > Cluster details: https://gitlab.uni-trier.de/snippets/77
> >
> > > MDS log:
> > > https://gitlab.uni-trier.de/snippets/79?expanded=true&viewer=simple)
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZIMK University of Trier
> > Germany
> >
> > Von: Ashley Merrick 
> > Gesendet: Montag, 18. Februar 2019 16:53
> > An: Hennen, Christian 
> > Cc: ceph-users@lists.ceph.com
> > Betreff: Re: [ceph-users] CephFS: client hangs
> >
> > Correct yes from my expirence OSD’s aswel.
> >
> > On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian <
> christian.hen...@uni-trier.de> wrote:
> >
> > Hi!
> >
> > >mon_max_pg_per_osd = 400
> > >
> > >In the ceph.conf and then restart all the services / or inject the
> > >config into the running admin
> >
> > I restarted all MONs, but I assume the OSDs need to be restarted as well?
> >
> > > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> > > cephfs clients also get evicted quickly?
> >
> > Yeah, it seems so. But strangely there is no indication of it in 'ceph
> > -s' or 'ceph health detail'. And they don't seem to be evicted
> > permanently? Right now, only 1 client is connected. The others are shut
> down since last week.
> > 'ceph osd blacklist ls' shows 0 entries.
> >
> >
> > Kind regards
> > Christian Hennen
> >
> > Project Manager Infrastructural Services ZI

Re: [ceph-users] crush map has straw_calc_version=0 and legacy tunables on luminous

2019-02-19 Thread David Turner

[1] Here is a really cool set of slides from Ceph Day Berlin where Dan van
der Ster uses the mgr balancer module with upmap to gradually change the
tunables of a cluster without causing major client impact.  The down side
for you is that upmap requires all luminous or newer clients, but if you
upgrade your kernel clients to 1.13+, then you can enable upmap in the
cluster and utilize the balancer module to upgrade your cluster tunables.
As stated [2] here that those kernel versions still report as Jewel
clients, but only because they are missing some non-essential luminous
client features even they they are fully compatible with the upmap
features, and other required features.

As a side note to the balancer manager in upmap mode, it balances your
cluster in such a way that it attempts to evenly distribute all PGs for a
pool evenly across all OSDs.  So if you have 3 different pools, the PGs for
those pools should each be within 1 or 2 PG totals on every OSD in your
cluster... it's really cool.  The slides discuss how to get your cluster to
that point as well, incase you have modified your weights or reweights at
all.


[1]
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer
[2]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031206.html

On Mon, Feb 4, 2019 at 6:31 PM Shain Miley  wrote:

> For future reference I found these 2 links which answer most of the
> questions:
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
>
> https://www.openstack.org/assets/presentation-media/Advanced-Tuning-and-Operation-guide-for-Block-Storage-using-Ceph-Boston-2017-final.pdf
>
>
>
> We have about 250TB (x3) in our cluster so I am leaning toward not
> changing things at this point because it sounds like there will be a
> significant amount of data movement involved for not a lot in return.
>
>
>
> If anyone knows of a strong reason I should change the tunables profile
> away from what I have…then please let me know so I don’t end up running the
> cluster in a sub-optimal state for no reason.
>
>
>
> Thanks,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
>
>
>
> *From: *ceph-users  on behalf of Shain
> Miley 
> *Date: *Monday, February 4, 2019 at 3:03 PM
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *[ceph-users] crush map has straw_calc_version=0 and legacy
> tunables on luminous
>
>
>
> Hello,
>
> I just upgraded our cluster to 12.2.11 and I have a few questions around
> straw_calc_version and tunables.
>
> Currently ceph status shows the following:
>
> crush map has straw_calc_version=0
>
> crush map has legacy tunables (require argonaut, min is firefly)
>
>
>
>1. Will setting tunables to optimal also change the staw_calc_version
>or do I need to set that separately?
>
>
>2. Right now I have a set of rbd kernel clients connecting using
>kernel version 4.4.  The ‘ceph daemon mon.id sessions’ command shows
>that this client is still connecting using the hammer feature set (and a
>few others on jewel as well):
>
>"MonSession(client.113933130 10.35.100.121:0/3425045489 is open allow
>*, features 0x7fddff8ee8cbffb (jewel))",  “MonSession(client.112250505
>10.35.100.99:0/4174610322 is open allow *, features 0x106b84a842a42
>(hammer))",
>
>My question is what is the minimum kernel version I would need to
>upgrade the 4.4 kernel server to in order to get to jewel or luminous?
>
>
>
>1. Will setting the tunables to optimal on luminous prevent jewel and
>hammer clients from connecting?  I want to make sure I don’t do anything
>will prevent my existing clients from connecting to the cluster.
>
>
>
>
> Thanks in advance,
>
> Shain
>
>
>
> --
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> smi...@npr.org | 202.513.3649
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

2019-02-19 Thread David Turner

With a RACK failure domain, you should be able to have an entire rack
powered down without noticing any major impact on the clients.  I regularly
take down OSDs and nodes for maintenance and upgrades without seeing any
problems with client IO.

On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
wrote:

> Hello - I have a couple of questions on ceph cluster stability, even
> we follow all recommendations as below:
> - Having separate replication n/w and data n/w
> - RACK is the failure domain
> - Using SSDs for journals (1:4ratio)
>
> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> Q2 - what is stability ratio, like with above, is ceph cluster
> workable condition, if one osd down or one node down,etc.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-19 Thread David Turner

Have you ever seen an example of a Ceph cluster being run and managed by
Rook?  It's a really cool idea and takes care of containerizing mons, rgw,
mds, etc that I've been thinking about doing anyway.  Having those
containerized means that if you can upgrade all of the mon services before
any of your other daemons are even aware of a new Ceph version even if
they're running on the same server.  There are some recent upgrade bugs for
small clusters with mons and osds on the same node that would have been
mitigated with containerized Ceph versions.  For putting OSDs in
containers, have you ever needed to run a custom compiled version of Ceph
for a few OSDs to get past a bug that was causing you some troubles?  With
OSDs in containers, you could do that without worrying about that version
of Ceph being used by any other OSDs.

On top of all of that, I keep feeling like a dinosaur for not understanding
Kubernetes better and have been really excited since seeing Rook
orchestrating a Ceph cluster in K8s.  I spun up a few VMs to start testing
configuring a Kubernetes cluster.  The Rook Slack channel recommended using
kubeadm to set up K8s to manage Ceph.

On Mon, Feb 18, 2019 at 11:50 AM Marc Roos  wrote:

>
> Why not just keep it bare metal? Especially with future ceph
> upgrading/testing. I am having centos7 with luminous and am running
> libvirt on the nodes aswell. If you configure them with a tls/ssl
> connection, you can even nicely migrate a vm, from one host/ceph node to
> the other.
> Next thing I am testing with is mesos, to use the ceph nodes to run
> containers. I am still testing this on some vm's, but looks like you
> have to install only a few rpms (maybe around 300MB) and 2 extra
> services on the nodes to get this up and running aswell. (But keep in
> mind that the help on their mailing list is not so good as here ;))
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: 18 February 2019 17:31
> To: ceph-users
> Subject: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook
>
> I'm getting some "new" (to me) hardware that I'm going to upgrade my
> home Ceph cluster with.  Currently it's running a Proxmox cluster
> (Debian) which precludes me from upgrading to Mimic.  I am thinking
> about taking the opportunity to convert most of my VMs into containers
> and migrate my cluster into a K8s + Rook configuration now that Ceph is
> [1] stable on Rook.
>
> I haven't ever configured a K8s cluster and am planning to test this out
> on VMs before moving to it with my live data.  Has anyone done a
> migration from a baremetal Ceph cluster into K8s + Rook?  Additionally
> what is a good way for a K8s beginner to get into managing a K8s
> cluster.  I see various places recommend either CoreOS or kubeadm for
> starting up a new K8s cluster but I don't know the pros/cons for either.
>
> As far as migrating the Ceph services into Rook, I would assume that the
> process would be pretty simple to add/create new mons, mds, etc into
> Rook with the baremetal cluster details.  Once those are active and
> working just start decommissioning the services on baremetal.  For me,
> the OSD migration should be similar since I don't have any multi-device
> OSDs so I only need to worry about migrating individual disks between
> nodes.
>
>
> [1]
> https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread David Turner

I don't know that there's anything that can be done to resolve this yet
without rebuilding the OSD.  Based on a Nautilus tool being able to resize
the DB device, I'm assuming that Nautilus is also capable of migrating the
DB/WAL between devices.  That functionality would allow anyone to migrate
their DB back off of their spinner which is what's happening to you.  I
don't believe that sort of tooling exists yet, though, without compiling
the Nautilus Beta tooling for yourself.

On Tue, Feb 19, 2019 at 12:03 AM Konstantin Shalygin  wrote:

> On 2/18/19 9:43 PM, David Turner wrote:
> > Do you have historical data from these OSDs to see when/if the DB used
> > on osd.73 ever filled up?  To account for this OSD using the slow
> > storage for DB, all we need to do is show that it filled up the fast
> > DB at least once.  If that happened, then something spilled over to
> > the slow storage and has been there ever since.
>
> Yes, I have. Also I checked my JIRA records what I was do at this times
> and marked this on timeline: [1]
>
> Another graph compared osd.(33|73) for a last year: [2]
>
>
> [1] https://ibb.co/F7smCxW
>
> [1] https://ibb.co/dKWWDzW
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-18 Thread David Turner

Everybody is just confused that you don't have a newer version of Ceph
available. Are you running `apt-get dist-upgrade` to upgrade ceph? Do you
have any packages being held back? There is no reason that Ubuntu 18.04
shouldn't be able to upgrade to 12.2.11.

On Mon, Feb 18, 2019, 4:38 PM  Hello people,
>
> Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:
> >Hello Ashley,
> >
> >Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
> >:
> >>What does the output of apt-get update look like on one of the nodes?
> >>
> >>You can just list the lines that mention CEPH
> >>
> >
> >... .. .
> >Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
> >B]
> >... .. .
> >
> >The Last available is 12.2.8.
>
> Any advice or recommends on how to proceed to be able to Update to
> mimic/(nautilus)?
>
> - Mehmet
> >
> >- Mehmet
> >
> >>Thanks
> >>
> >>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
> >>
> >>> Hello Ashley,
> >>>
> >>> Thank you for this fast response.
> >>>
> >>> I cannt prove this jet but i am using already cephs own repo for
> >>Ubuntu
> >>> 18.04 and this 12.2.7/8 is the latest available there...
> >>>
> >>> - Mehmet
> >>>
> >>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
> >>> singap...@amerrick.co.uk>:
> >>> >Around available versions, are you using the Ubuntu repo’s or the
> >>CEPH
> >>> >18.04 repo.
> >>> >
> >>> >The updates will always be slower to reach you if your waiting for
> >>it
> >>> >to
> >>> >hit the Ubuntu repo vs adding CEPH’s own.
> >>> >
> >>> >
> >>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
> >>> >
> >>> >> Hello m8s,
> >>> >>
> >>> >> Im curious how we should do an Upgrade of our ceph Cluster on
> >>Ubuntu
> >>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7
> >(or
> >>> >.8?)
> >>> >>
> >>> >> For an Upgrade to mimic we should First Update to Last version,
> >>> >actualy
> >>> >> 12.2.11 (iirc).
> >>> >> Which is not possible on 18.04.
> >>> >>
> >>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
> >>> >better the
> >>> >> upcoming nautilus?
> >>> >>
> >>> >> Any advice?
> >>> >>
> >>> >> - Mehmet___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IRC channels now require registered and identified users

2019-02-18 Thread David Turner

Is this still broken in the 1-way direction where Slack users' comments do
not show up in IRC?  That would explain why nothing I ever type (as either
helping someone or asking a question) ever have anyone respond to them.

On Tue, Dec 18, 2018 at 6:50 AM Joao Eduardo Luis  wrote:

> On 12/18/2018 11:22 AM, Joao Eduardo Luis wrote:
> > On 12/18/2018 11:18 AM, Dan van der Ster wrote:
> >> Hi Joao,
> >>
> >> Has that broken the Slack connection? I can't tell if its broken or
> >> just quiet... last message on #ceph-devel was today at 1:13am.
> >
> > Just quiet, it seems. Just tested it and the bridge is still working.
>
> Okay, turns out the ceph-ircslackbot user is not identified, and that
> makes it unable to send messages to the channel. This means the bridge
> is working in one direction only (irc to slack), and will likely break
> when/if the user leaves the channel (as it won't be able to get back in).
>
> I will figure out just how this works today. In the mean time, I've
> relaxed the requirement for registered/identified users so that the bot
> works again. It will be reactivated once this is addressed.
>
>   -Joao
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-18 Thread David Turner

I'm getting some "new" (to me) hardware that I'm going to upgrade my home
Ceph cluster with.  Currently it's running a Proxmox cluster (Debian) which
precludes me from upgrading to Mimic.  I am thinking about taking the
opportunity to convert most of my VMs into containers and migrate my
cluster into a K8s + Rook configuration now that Ceph is [1] stable on Rook.

I haven't ever configured a K8s cluster and am planning to test this out on
VMs before moving to it with my live data.  Has anyone done a migration
from a baremetal Ceph cluster into K8s + Rook?  Additionally what is a good
way for a K8s beginner to get into managing a K8s cluster.  I see various
places recommend either CoreOS or kubeadm for starting up a new K8s cluster
but I don't know the pros/cons for either.

As far as migrating the Ceph services into Rook, I would assume that the
process would be pretty simple to add/create new mons, mds, etc into Rook
with the baremetal cluster details.  Once those are active and working just
start decommissioning the services on baremetal.  For me, the OSD migration
should be similar since I don't have any multi-device OSDs so I only need
to worry about migrating individual disks between nodes.


[1] https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-18 Thread David Turner

We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster. The clusters are
12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster
and 30 NVMe's in total. They were all built at the same time and were
running firmware version QDV10130. On this firmware version we early on
had 2 disks failures, a few months later we had 1 more, and then a month
after that (just a few weeks ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS. This
holds true beyond server reboots as well as placing the failed disks into a
new server. With a firmware upgrade tool we got an error that pretty much
said there's no way to get data back and to RMA the disk. We upgraded all
of our remaining disks' firmware to QDV101D1 and haven't had any problems
since then. Most of our failures happened while rebalancing the cluster
after replacing dead disks and we tested rigorously around that use case
after upgrading the firmware. This firmware version seems to have resolved
whatever the problem was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the
QDV10130 firmware as well as firmwares between this one and the one we
upgraded to. Bluestore on Ceph is the only use case we've had so far with
this sort of failure.

Has anyone else come across this issue before? Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the
older firmware version that isn't triggered by more traditional
filesystems. We have a scheduled call with Intel to discuss this, but
their preliminary searches into the bugfixes and known problems between
firmware versions didn't indicate the bug that we triggered. It would be
good to have some more information about what those differences for disk
accessing might be to hopefully get a better answer from them as to what
the problem is.

[1]
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread David Turner

Also what commands did you run to remove the failed HDDs and the commands
you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin  wrote:

> I recently replaced failed HDDs and removed them from their respective
> buckets as per procedure.
>
> But I’m now facing an issue when trying to place new ones back into the
> buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or
> directory not found’ OR command sintax error.
>
> I have been using the commands below:
>
> ceph osd crush set   
> ceph osd crush  set   
>
> I do however find the OSD number when i run command:
>
> ceph osd find 
>
> Your assistance/response to this will be highly appreciated.
>
> Regards
> John.
>
>
> Please, paste your `ceph osd tree`, your version and what exactly error
> you get include osd number.
>
> Less obfuscation is better in this, perhaps, simple case.
>
>
> k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread David Turner

Do you have historical data from these OSDs to see when/if the DB used on
osd.73 ever filled up?  To account for this OSD using the slow storage for
DB, all we need to do is show that it filled up the fast DB at least once.
If that happened, then something spilled over to the slow storage and has
been there ever since.

On Sat, Feb 16, 2019 at 1:50 AM Konstantin Shalygin  wrote:

> On 2/16/19 12:33 AM, David Turner wrote:
> > The answer is probably going to be in how big your DB partition is vs
> > how big your HDD disk is.  From your output it looks like you have a
> > 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size
> > isn't currently full, I would guess that at some point since this OSD
> > was created that it did fill up and what you're seeing is the part of
> > the DB that spilled over to the data disk. This is why the official
> > recommendation (that is quite cautious, but cautious because some use
> > cases will use this up) for a blocks.db partition is 4% of the data
> > drive.  For your 6TB disks that's a recommendation of 240GB per DB
> > partition.  Of course the actual size of the DB needed is dependent on
> > your use case.  But pretty much every use case for a 6TB disk needs a
> > bigger partition than 28GB.
>
>
> My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> 2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> 6388Mbyte (6.69% of db_total_bytes).
>
> Why osd.33 is not used slow storage at this case?
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread David Turner

The answer is probably going to be in how big your DB partition is vs how
big your HDD disk is.  From your output it looks like you have a 6TB HDD
with a 28GB Blocks.DB partition.  Even though the DB used size isn't
currently full, I would guess that at some point since this OSD was created
that it did fill up and what you're seeing is the part of the DB that
spilled over to the data disk.  This is why the official recommendation
(that is quite cautious, but cautious because some use cases will use this
up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
that's a recommendation of 240GB per DB partition.  Of course the actual
size of the DB needed is dependent on your use case.  But pretty much every
use case for a 6TB disk needs a bigger partition than 28GB.

On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin  wrote:

> Wrong metadata paste of osd.73 in previous message.
>
>
> {
>
>  "id": 73,
>  "arch": "x86_64",
>  "back_addr": "10.10.10.6:6804/175338",
>  "back_iface": "vlan3",
>  "bluefs": "1",
>  "bluefs_db_access_mode": "blk",
>  "bluefs_db_block_size": "4096",
>  "bluefs_db_dev": "259:22",
>  "bluefs_db_dev_node": "nvme2n1",
>  "bluefs_db_driver": "KernelDevice",
>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>  "bluefs_db_rotational": "0",
>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_db_size": "30064771072",
>  "bluefs_db_type": "nvme",
>  "bluefs_single_shared_device": "0",
>  "bluefs_slow_access_mode": "blk",
>  "bluefs_slow_block_size": "4096",
>  "bluefs_slow_dev": "8:176",
>  "bluefs_slow_dev_node": "sdl",
>  "bluefs_slow_driver": "KernelDevice",
>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>  "bluefs_slow_partition_path": "/dev/sdl2",
>  "bluefs_slow_rotational": "1",
>  "bluefs_slow_size": "6001069199360",
>  "bluefs_slow_type": "hdd",
>  "bluefs_wal_access_mode": "blk",
>  "bluefs_wal_block_size": "4096",
>  "bluefs_wal_dev": "259:22",
>  "bluefs_wal_dev_node": "nvme2n1",
>  "bluefs_wal_driver": "KernelDevice",
>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>  "bluefs_wal_rotational": "0",
>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_wal_size": "1073741824",
>  "bluefs_wal_type": "nvme",
>  "bluestore_bdev_access_mode": "blk",
>  "bluestore_bdev_block_size": "4096",
>  "bluestore_bdev_dev": "8:176",
>  "bluestore_bdev_dev_node": "sdl",
>  "bluestore_bdev_driver": "KernelDevice",
>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>  "bluestore_bdev_partition_path": "/dev/sdl2",
>  "bluestore_bdev_rotational": "1",
>  "bluestore_bdev_size": "6001069199360",
>  "bluestore_bdev_type": "hdd",
>  "ceph_version": "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>  "default_device_class": "hdd",
>  "distro": "centos",
>  "distro_description": "CentOS Linux 7 (Core)",
>  "distro_version": "7",
>  "front_addr": "172.16.16.16:6803/175338",
>  "front_iface": "vlan4",
>  "hb_back_addr": "10.10.10.6:6805/175338",
>  "hb_front_addr": "172.16.16.16:6805/175338",
>  "hostname": "ceph-osd5",
>  "journal_rotational": "0",
>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>  "mem_swap_kb": "0",
>  "mem_total_kb": "65724256",
>  "os": "Linux",
>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>  "osd_objectstore": "bluestore",
>  "rotational": "1"
> }
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread David Turner

I'm leaving the response on the CRUSH rule for Gregory, but you have
another problem you're running into that is causing more of this data to
stay on this node than you intend.  While you `out` the OSD it is still
contributing to the Host's weight.  So the host is still set to receive
that amount of data and distribute it among the disks inside of it.  This
is the default behavior (even if you `destroy` the OSD) to minimize the
data movement for losing the disk and again for adding it back into the
cluster after you replace the device.  If you are really strapped for
space, though, then you might consider fully purging the OSD which will
reduce the Host weight to what the other OSDs are.  However if you do have
a problem in your CRUSH rule, then doing this won't change anything for you.

On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2  wrote:

> Thanks. I read the your reply in
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html
> so using indep will do fewer data remap when osd failed.
> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap
> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap
>
> am I right?
> if so, what recommend to do when a disk failed and the total available
> size of the rest disk in the machine is not enough(can not replace failed
> disk immediately). or I should reserve more available size in EC situation.
>
> On 02/14/2019 02:49，Gregory Farnum
>  wrote：
>
> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: （detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125）
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> tuna

Re: [ceph-users] Problems with osd creation in Ubuntu 18.04, ceph 13.2.4-1bionic

2019-02-15 Thread David Turner

I have found that running a zap before all prepare/create commands with
ceph-volume helps things run smoother.  Zap is specifically there to clear
everything on a disk away to make the disk ready to be used as an OSD.
Your wipefs command is still fine, but then I would lvm zap the disk before
continuing.  I would run the commands like [1] this.  I also prefer the
single command lvm create as opposed to lvm prepare and lvm activate.  Try
that out and see if you still run into the problems creating the BlueStore
filesystem.

[1] ceph-volume lvm zap /dev/sdg
ceph-volume lvm prepare --bluestore --data /dev/sdg

On Thu, Feb 14, 2019 at 10:25 AM Rainer Krienke 
wrote:

> Hi,
>
> I am quite new to ceph and just try to set up a ceph cluster. Initially
> I used ceph-deploy for this but when I tried to create a BlueStore osd
> ceph-deploy fails. Next I tried the direct way on one of the OSD-nodes
> using ceph-volume to create the osd, but this also fails. Below you can
> see what  ceph-volume says.
>
> I ensured that there was no left over lvm VG and LV on the disk sdg
> before I started the osd creation for this disk. The very same error
> happens also on other disks not just for /dev/sdg. All the disk have 4TB
> in size and the linux system is Ubuntu 18.04 and finally ceph is
> installed in version 13.2.4-1bionic from this repo:
> https://download.ceph.com/debian-mimic.
>
> There is a VG and two LV's  on the system for the ubuntu system itself
> that is installed on two separate disks configured as software raid1 and
> lvm on top of the raid. But I cannot imagine that this might do any harm
> to cephs osd creation.
>
> Does anyone have an idea what might be wrong?
>
> Thanks for hints
> Rainer
>
> root@ceph1:~# wipefs -fa /dev/sdg
> root@ceph1:~# ceph-volume lvm prepare --bluestore --data /dev/sdg
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new 14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /sbin/vgcreate --force --yes
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b /dev/sdg
>  stdout: Physical volume "/dev/sdg" successfully created.
>  stdout: Volume group "ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b"
> successfully created
> Running command: /sbin/lvcreate --yes -l 100%FREE -n
> osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b
>  stdout: Logical volume "osd-block-14d041d6-0beb-4056-8df2-3920e2febce0"
> created.
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> --> Absolute path not found for executable: restorecon
> --> Ensure $PATH environment variable contains common executable locations
> Running command: /bin/chown -h ceph:ceph
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /bin/chown -R ceph:ceph /dev/dm-8
> Running command: /bin/ln -s
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> /var/lib/ceph/osd/ceph-0/block
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
>  stderr: got monmap epoch 1
> Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring
> --create-keyring --name osd.0 --add-key
> AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ==
>  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
> added entity osd.0 auth auth(auid = 18446744073709551615
> key=AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ== with 0 caps)
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
> Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore
> bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap
> --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid
> 14d041d6-0beb-4056-8df2-3920e2febce0 --setuser ceph --setgroup ceph
>  stderr: 2019-02-14 13:45:54.788 7f3fcecb3240 -1
> bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: In
> function 'virtual int KernelDevice::read(uint64_t, uint64_t,
> ceph::bufferlist*, IOContext*, bool)' thread 7f3fcecb3240 time
> 2019-02-14 13:45:54.841130
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: 821:
> FAILED assert((uint64_t)r == len)
>  stderr: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e)
> mimic (stable)
>  stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
> char const*)+0x102) [0x7f3fc60d33e2]
>  stderr: 2: (()+0x26d5a7) [0x7f3fc60d35a7]
>  stderr: 3: (KernelDevice::read(unsigned long, unsigned long,
> ceph::buffer::list*, IOContext*, bool)+0x4a7) [0x561371346817]
>  stderr: 4: (BlueFS::_read(BlueFS::FileReade

Re: [ceph-users] [Ceph-community] Deploy and destroy monitors

2019-02-13 Thread David Turner

Ceph-users is the proper ML to post questions like this.

On Thu, Dec 20, 2018 at 2:30 PM Joao Eduardo Luis  wrote:

> On 12/20/2018 04:55 PM, João Aguiar wrote:
> > I am having an issue with "ceph-ceploy mon”
> >
> > I started by creating a cluster with one monitor with "create-deploy
> new"… "create-initial”...
> > And ended up with ceph,conf like:
> > ...
> > mon_initial_members = node0
> > mon_host = 10.2.2.2
> > ….
> >
> > Later I try to deploy a new monitor (ceph-deploy mon create node1),
> wait for it to get in quorum and then destroy the node0 (ceph-deploy mon
> destroy node0).
>
> Is the new monitor forming a quorum with the existing monitor? If not,
> then you won't have monitors running when you remove node0.
>
> Does ceph-deploy remove the mon being destroyed from the monmap? If not,
> you'll have two monitors in the monmap, and you'll need a majority to
> form quorum; for a 2 monitor deployment that means you'll need 2
> monitors up and running.
>
> > Result: Ceph gets unresponsive.
>
> This is the typical symptom of absence of a quorum.
>
>   -Joao
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Ceph SSE-KMS integration to use Safenet as Key Manager service

2019-02-13 Thread David Turner

Ceph-users is the correct ML to post questions like this.

On Wed, Jan 2, 2019 at 5:40 PM Rishabh S  wrote:

> Dear Members,
>
> Please let me know if you have any link with examples/detailed steps of
> Ceph-Safenet(KMS) integration.
>
> Thanks & Regards,
> Rishabh
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Error during playbook deployment: TASK [ceph-mon : test if rbd exists]

2019-02-13 Thread David Turner

Ceph-users ML is the proper mailing list for questions like this.

On Sat, Jan 26, 2019 at 12:31 PM Meysam Kamali  wrote:

> Hi Ceph Community,
>
> I am using ansible 2.2 and ceph branch stable-2.2, on centos7, to deploy
> the playbook. But the deployment get hangs in this step "TASK [ceph-mon :
> test if rbd exists]". it gets hangs there and doesnot move.
> I have all the three ceph nodes ceph-admin, ceph-mon, ceph-osd
> I appreciate any help! Here I am providing log:
>
> ---Log --
> TASK [ceph-mon : test if rbd exists]
> ***
> task path: /root/ceph-ansible/roles/ceph-mon/tasks/ceph_keys.yml:60
> Using module file
> /usr/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'echo ~ &&
> sleep 0'"'"''
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'( umask 77 &&
> mkdir -p "` echo
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896 `" && echo
> ansible-tmp-1547740115.56-213823795856896="` echo
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896 `" ) && sleep
> 0'"'"''
>  PUT /tmp/tmpG7u1eN TO
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py
>  SSH: EXEC sftp -b - -C -o ControlMaster=auto -o
> ControlPersist=60s -o KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r '[ceph2mon]'
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r ceph2mon '/bin/sh -c '"'"'chmod u+x
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py &&
> sleep 0'"'"''
>  ESTABLISH SSH CONNECTION FOR USER: None
>  SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o
> KbdInteractiveAuthentication=no -o
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
> -o PasswordAuthentication=no -o ConnectTimeout=10 -o
> ControlPath=/root/.ansible/cp/%h-%r -tt ceph2mon '/bin/sh -c '"'"'sudo -H
> -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo
> BECOME-SUCCESS-iefqzergptqzfhqmxouabfjfvdvbadku; /usr/bin/python
> /root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/command.py; rm
> -rf "/root/.ansible/tmp/ansible-tmp-1547740115.56-213823795856896/" >
> /dev/null 2>&1'"'"'"'"'"'"'"'"' && sleep 0'"'"''
>
> -
>
>
> Thanks,
> Meysam Kamali
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Need help related to ceph client authentication

2019-02-13 Thread David Turner

The Ceph-users ML is the correct list to ask questions like this.  Did you
figure out the problems/questions you had?

On Tue, Dec 4, 2018 at 11:39 PM Rishabh S  wrote:

> Hi Gaurav,
>
> Thank You.
>
> Yes, I am using boto, though I was looking for suggestions on how my ceph
> client should get access and secret keys.
>
> Another thing where I need help is regarding encryption
> http://docs.ceph.com/docs/mimic/radosgw/encryption/#
>
> I am little confused what does these statement means.
>
> The Ceph Object Gateway supports server-side encryption of uploaded
> objects, with 3 options for the management of encryption keys. Server-side
> encryption means that the data is sent over HTTP in its unencrypted form,
> and the Ceph Object Gateway stores that data in the Ceph Storage Cluster in
> encrypted form.
>
> Note
>
>
> Requests for server-side encryption must be sent over a secure HTTPS
> connection to avoid sending secrets in plaintext.
>
> CUSTOMER-PROVIDED KEYS
> 
>
> In this mode, the client passes an encryption key along with each request
> to read or write encrypted data. It is the client’s responsibility to
> manage those keys and remember which key was used to encrypt each object.
>
> My understanding is when ceph client is trying to upload a file/object to
> Ceph cluster then client request should be https and will include
>  “customer-provided-key”.
> Then Ceph will use customer-provided-key to encrypt file/object before
> storing data into Ceph cluster.
>
> Please correct and suggest best approach to store files/object in Ceph
> cluster.
>
> Any code example of initial handshake to upload a file/object with
> encryption-key will be of great help.
>
> Regards,
> Rishabh
>
>
> On 05-Dec-2018, at 2:15 AM, Gaurav Sitlani 
> wrote:
>
> Hi Rishabh,
> You can refer the ceph RGW doc and search for boto :
> http://docs.ceph.com/docs/master/install/install-ceph-gateway/?highlight=boto
> You can get a basic python boto script where you can mention your access
> and secret key and connect to your S3 cluster.
> I hope you know how to get your keys right.
>
> Regards,
> Gaurav Sitlani
>
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] all vms can not start up when boot all the ceph hosts.

2019-02-13 Thread David Turner

This might not be a Ceph issue at all depending on if you're using any sort
of caching.  If you have caching on your disk controllers at all, then the
write might have happened to the cache but never made it to the OSD disks
which would show up as problems on the VM RBDs.  Make sure you have proper
BBU's on your disk controllers and/or disable caching that might be enabled
on your controllers or disks that could be benefiting you with write speed
while the cluster is healthy, but potentially causing you to run into this
state during a catastrophe.

On Tue, Dec 4, 2018 at 10:49 PM linghucongsong 
wrote:

>
> Thanks to all! I might have found the reason.
>
> It is look like relate to the below bug.
>
> https://bugs.launchpad.net/nova/+bug/1773449
>
>
>
>
> At 2018-12-04 23:42:15, "Ouyang Xu"  wrote:
>
> Hi linghucongsong:
>
> I have got this issue before, you can try to fix it as below:
>
> 1. use *rbd lock ls* to get the lock for the vm
> 2. use *rbd lock rm* to remove that lock for the vm
> 3. start vm again
>
> hope that can help you.
>
> regards,
>
> Ouyang
>
> On 2018/12/4 下午4:48, linghucongsong wrote:
>
> HI all!
>
> I have a ceph test envirment use ceph with openstack. There are some vms
> run on the openstack. It is just a test envirment.
>
> my ceph version is 12.2.4. Last day I reboot all the ceph hosts before
> this I do not shutdown the vms on the openstack.
>
> When all the hosts boot up and the ceph become healthy. I  found all the
> vms can not start up. All the vms have the
>
> below xfs error. Even I use xfs_repair also can not repair this problem .
>
> It is just a test envrement so the data is not important  to me. I know
> the ceph version 12.2..4 is not stable
>
> enough but how does it have so serious problems. Mind to other people care
> about this. Thanks to all. :)
>
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> 【网易自营|30天无忧退货】真性价比：网易员工用纸“无添加谷风一木软抽面巾纸”，限时仅16.9元一提>>
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to mount one of the cephfs namespace using ceph-fuse?

2019-02-13 Thread David Turner

Note that this format in fstab does require a certain version of util-linux
because of the funky format of the line.  Pretty much it maps all command
line options at the beginning of the line separated with commas.

On Wed, Feb 13, 2019 at 2:10 PM David Turner  wrote:

> I believe the fstab line for ceph-fuse in this case would look something
> like [1] this.  We use a line very similar to that to mount cephfs at a
> specific client_mountpoint that the specific cephx user only has access to.
>
> [1] id=acapp3,client_mds_namespace=fs1   /tmp/ceph   fuse.ceph
>  defaults,noatime,_netdev 0 2
>
> On Tue, Dec 4, 2018 at 3:22 AM Zhenshi Zhou  wrote:
>
>> Hi
>>
>> I can use this mount cephfs manually. But how to edit fstab so that the
>> system will auto-mount cephfs by ceph-fuse?
>>
>> Thanks
>>
>> Yan, Zheng  于2018年11月20日周二 下午8:08写道：
>>
>>> ceph-fuse --client_mds_namespace=xxx
>>> On Tue, Nov 20, 2018 at 7:33 PM ST Wong (ITSC)  wrote:
>>> >
>>> > Hi all,
>>> >
>>> >
>>> >
>>> > We’re using mimic and enabled multiple fs flag. We can do
>>> kernel mount of particular fs (e.g. fs1) with mount option
>>> mds_namespace=fs1.However, this is not working for ceph-fuse:
>>> >
>>> >
>>> >
>>> > #ceph-fuse -n client.acapp3 -o mds_namespace=fs1 /tmp/ceph
>>> >
>>> > 2018-11-20 19:30:35.246 7ff5653edcc0 -1 init, newargv =
>>> 0x5564a21633b0 newargc=9
>>> >
>>> > fuse: unknown option `mds_namespace=fs1'
>>> >
>>> > ceph-fuse[3931]: fuse failed to start
>>> >
>>> > 2018-11-20 19:30:35.264 7ff5653edcc0 -1 fuse_lowlevel_new failed
>>> >
>>> >
>>> >
>>> > Sorry that I can’t find the correct option in ceph-fuse man page or
>>> doc.
>>> >
>>> > Please help.   Thanks a lot.
>>> >
>>> >
>>> >
>>> > Best Rgds
>>> >
>>> > /stwong
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to mount one of the cephfs namespace using ceph-fuse?

2019-02-13 Thread David Turner

I believe the fstab line for ceph-fuse in this case would look something
like [1] this.  We use a line very similar to that to mount cephfs at a
specific client_mountpoint that the specific cephx user only has access to.

[1] id=acapp3,client_mds_namespace=fs1   /tmp/ceph   fuse.ceph
 defaults,noatime,_netdev 0 2

On Tue, Dec 4, 2018 at 3:22 AM Zhenshi Zhou  wrote:

> Hi
>
> I can use this mount cephfs manually. But how to edit fstab so that the
> system will auto-mount cephfs by ceph-fuse?
>
> Thanks
>
> Yan, Zheng  于2018年11月20日周二 下午8:08写道：
>
>> ceph-fuse --client_mds_namespace=xxx
>> On Tue, Nov 20, 2018 at 7:33 PM ST Wong (ITSC)  wrote:
>> >
>> > Hi all,
>> >
>> >
>> >
>> > We’re using mimic and enabled multiple fs flag. We can do
>> kernel mount of particular fs (e.g. fs1) with mount option
>> mds_namespace=fs1.However, this is not working for ceph-fuse:
>> >
>> >
>> >
>> > #ceph-fuse -n client.acapp3 -o mds_namespace=fs1 /tmp/ceph
>> >
>> > 2018-11-20 19:30:35.246 7ff5653edcc0 -1 init, newargv =
>> 0x5564a21633b0 newargc=9
>> >
>> > fuse: unknown option `mds_namespace=fs1'
>> >
>> > ceph-fuse[3931]: fuse failed to start
>> >
>> > 2018-11-20 19:30:35.264 7ff5653edcc0 -1 fuse_lowlevel_new failed
>> >
>> >
>> >
>> > Sorry that I can’t find the correct option in ceph-fuse man page or
>> doc.
>> >
>> > Please help.   Thanks a lot.
>> >
>> >
>> >
>> > Best Rgds
>> >
>> > /stwong
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] compacting omap doubles its size

2019-02-13 Thread David Turner

Sorry for the late response on this, but life has been really busy over the
holidays.

We compact our omaps offline with the ceph-kvstore-tool.  Here [1] is a
copy of the script that we use for our clusters.  You might need to modify
things a bit for your environment.  I don't remember which version this
functionality was added to ceph-kvstore-tool, but it exists in 12.2.4.  We
need to do this because our OSDs get marked out when they try to compact
their own omaps online.  We run this script monthly and then ad-hoc as we
find OSDs compacting their own omaps live.

[1] https://gist.github.com/drakonstein/4391c0b268a35b64d4f26a12e5058ba9

On Thu, Nov 29, 2018 at 6:15 PM Tomasz Płaza 
wrote:

> Hi,
>
> I have a ceph 12.2.8 cluster on filestore with rather large omap dirs
> (avg size is about 150G). Recently slow requests became a problem, so
> after some digging I decided to convert omap from leveldb to rocksdb.
> Conversion went fine and slow requests rate went down to acceptable
> level. Unfortunately  conversion did not shrink most of omap dirs, so I
> tried online compaction:
>
> Before compaction: 50G/var/lib/ceph/osd/ceph-0/current/omap/
>
> After compaction: 100G/var/lib/ceph/osd/ceph-0/current/omap/
>
> Purge and recreate: 1.5G /var/lib/ceph/osd/ceph-0/current/omap/
>
>
> Before compaction: 135G/var/lib/ceph/osd/ceph-5/current/omap/
>
> After compaction: 260G/var/lib/ceph/osd/ceph-5/current/omap/
>
> Purge and recreate: 2.5G /var/lib/ceph/osd/ceph-5/current/omap/
>
>
> For me compaction which makes omap bigger is quite weird and
> frustrating. Please help.
>
>
> P.S. My cluster suffered from ongoing index reshards (it is disabled
> now) and on many buckets with 4m+ objects I have a lot of old indexes:
>
> 634   bucket1
> 651   bucket2
>
> ...
> 1231 bucket17
> 1363 bucket18
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can not start osd service by systemd

2018-11-19 Thread David Turner

I believe I fixed this issue by running `systemctl enable ceph-osd@n.service`
for all of the OSDs and then it wasn't a problem in future.

On Fri, Nov 9, 2018 at 9:30 PM  wrote:

> Hi!
>
> I find a confused question about start/stop ceph cluster by systemd:
>
> - when cluster is on, restart ceph.target can restart all osd service
> - when cluster is down, start ceph.target or start ceph-osd.target can not
> start osd service
>
>
> I have google this issue, seems the workaround is start ceph-osd@n.service
> by hand.
>
> Is it a bug?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate OSD journal to SSD partition

2018-11-19 Thread David Turner

For this the procedure is generally to stop the osd, flush the journal,
update the symlink on the osd to the new journal location, mkjournal, start
osd.  You shouldn't need to do anything in the ceph.conf file.

On Thu, Nov 8, 2018 at 2:41 AM  wrote:

> Hi all,
>
>
>
> I have been trying to migrate the journal to SSD partition for an while,
> basically I followed the guide here [1],  I have the below configuration
> defined in the ceph.conf
>
>
>
> [osd.0]
>
> osd_journal = /dev/disk/by-partlabel/journal-1
>
>
>
> And then create the journal in this way,
>
> # ceph-osd -i 0 –mkjournal
>
>
>
> After that, I started the osd,  and I saw the service is started
> successfully from the log print out on the console,
>
> 08 14:03:35 ceph1 ceph-osd[5111]: starting osd.0 at :/0 osd_data
> /var/lib/ceph/osd/ceph-0 /dev/disk/by-partlabel/journal-1
>
> 08 14:03:35 ceph1 ceph-osd[5111]: 2018-11-08 14:03:35.618247 7fe8b54b28c0
> -1 osd.0 766 log_to_monitors {default=true}
>
>
>
> But I not sure whether the new journal is effective or not, looks like it
> is still using the old partition (/dev/sdc2) for journal, and new partition
> which is actually “dev/sde1” has no information on the journal,
>
>
>
> # ceph-disk list
>
>
>
> /dev/sdc :
>
> /dev/sdc2 ceph journal, for /dev/sdc1
>
> /dev/sdc1 ceph data, active, cluster ceph, osd.0, journal /dev/sdc2
>
> /dev/sdd :
>
> /dev/sdd2 ceph journal, for /dev/sdd1
>
> /dev/sdd1 ceph data, active, cluster ceph, osd.1, journal /dev/sdd2
>
> /dev/sde :
>
> /dev/sde1 other, 0fc63daf-8483-4772-8e79-3d69d8477de4
>
> /dev/sdf other, unknown
>
>
>
> # ls -l /var/lib/ceph/osd/ceph-0/journal
>
> lrwxrwxrwx 1 ceph ceph 58  21  2018 /var/lib/ceph/osd/ceph-0/journal ->
> /dev/disk/by-partuuid/5b5cd6f6-5de4-44f3-9d33-e8a7f4b59f61
>
>
>
> # ls -l /dev/disk/by-partuuid/5b5cd6f6-5de4-44f3-9d33-e8a7f4b59f61
>
> lrwxrwxrwx 1 root root 10 8 13:59
> /dev/disk/by-partuuid/5b5cd6f6-5de4-44f3-9d33-e8a7f4b59f61 -> ../../sdc2
>
>
>
>
>
> My question is how I know which partition is taking the role of journal?
> Where can I see the new journal partition is linked?
>
>
>
> Any comments is highly appreciated!
>
>
>
>
>
> [1] https://fatmin.com/2015/08/11/ceph-show-osd-to-journal-mapping/
>
>
>
>
>
> Best Regards,
>
> Dave Chen
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic - EC and crush rules - clarification

2018-11-16 Thread David Turner

The difference for 2+2 vs 2x replication isn't in the amount of space being
used or saved, but in the amount of OSDs you can safely lose without any
data loss or outages.  2x replication is generally considered very unsafe
for data integrity, but 2+2 would is as resilient as 3x replication while
only using as much space as 2x replication.

On Thu, Nov 1, 2018 at 11:25 PM Wladimir Mutel  wrote:

> David Turner wrote:
> > Yes, when creating an EC profile, it automatically creates a CRUSH rule
> > specific for that EC profile.  You are also correct that 2+1 doesn't
> > really have any resiliency built in.  2+2 would allow 1 node to go down
> > while still having your data accessible.  It will use 2x data to raw as
>
> Is not EC 2+2 the same as 2x replication (i.e. RAID1) ?
> Is not EC benefit and intention to allow equivalent replication
> factors be chosen between >1 and <2 ?
> That's why it is recommended to have m parameters. Because when you have m==k, it is equivalent to 2x
> replication, with m==2k - to 3x replication and so on.
> And correspondingly, with m==1 you have equivalent reliability
> of RAID5, with m==2 - that of RAID6, and you start to have more
> "interesting" reliability factors only when you could allow m>2
> and k>m. Overall, your reliability in Ceph is measured as a
> cluster rebuild/performance degradation time in case of
> up-to m OSDs failure, provided that no more than m OSDs
> (or larger failure domains) have failed at once.
> Sure, EC is beneficial only when you have enough failure domains
> (i.e. hosts). My criterion is that you should have more hosts
> than you have individual OSDs within a single host.
> I.e. at least 8 (and better >8) hosts when you have 8 OSDs
> per host.
>
> > opposed to the 1.5x of 2+1, but it gives you resiliency.  The example in
> > your command of 3+2 is not possible with your setup.  May I ask why you
> > want EC on such a small OSD count?  I'm guessing to not use as much
> > storage on your SSDs, but I would just suggest going with replica with
> > such a small cluster.  If you have a larger node/OSD count, then you can
> > start seeing if EC is right for your use case, but if this is production
> > data... I wouldn't risk it.
>
> > When setting the crush rule, it wants the name of it, ssdrule, not 2.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph 12.2.9 release

2018-11-07 Thread David Turner

My big question is that we've had a few of these releases this year that
are bugged and shouldn't be upgraded to... They don't have any release
notes or announcement and the only time this comes out is when users
finally ask about it weeks later.  Why is this not proactively announced to
avoid a problematic release and hopefully prevent people from installing
it?  It would be great if there was an actual release notes saying not to
upgrade to this version or something.

On Wed, Nov 7, 2018 at 11:16 AM Ashley Merrick 
wrote:

> I am seeing this on the latest mimic on my test cluster aswel.
>
> Every automatic deep-scrub comes back as inconsistent, but doing another
> manual scrub comes back as fine and clear each time.
>
> Not sure if related or not..
>
> On Wed, 7 Nov 2018 at 11:57 PM, Christoph Adomeit <
> christoph.adom...@gatworks.de> wrote:
>
>> Hello together,
>>
>> we have upgraded to 12.2.9 because it was in the official repos.
>>
>> Right after the update and some scrubs we have issues.
>>
>> This morning after regular scrubs we had around 10% of all pgs inconstent:
>>
>> pgs: 4036 active+clean
>>   380  active+clean+inconsistent
>>
>> After repairung these 380 pgs we again have:
>>
>> 1/93611534 objects unfound (0.000%)
>> 28   active+clean+inconsistent
>> 1active+recovery_wait+degraded
>>
>> Now we stopped repairing because it does not seem to solve the problem
>> and more and more error messages are occuring. So far we did not see
>> corruption but we do not feel well with the cluster.
>>
>> What do you suggest, wait for 12.2.10 ? Roll Back to 12.2.8 ?
>>
>> Is ist dangerous for our Data to leave the cluster running ?
>>
>> I am sure we do not have hardware errors and that these errors came with
>> the update to 12.2.9.
>>
>> Thanks
>>   Christoph
>>
>>
>>
>> On Wed, Nov 07, 2018 at 07:39:59AM -0800, Gregory Farnum wrote:
>> > On Wed, Nov 7, 2018 at 5:58 AM Simon Ironside 
>> > wrote:
>> >
>> > >
>> > >
>> > > On 07/11/2018 10:59, Konstantin Shalygin wrote:
>> > > >> I wonder if there is any release announcement for ceph 12.2.9 that
>> I
>> > > missed.
>> > > >> I just found the new packages on download.ceph.com, is this an
>> official
>> > > >> release?
>> > > >
>> > > > This is because 12.2.9 have a several bugs. You should avoid to use
>> this
>> > > > release and wait for 12.2.10
>> > >
>> > > Argh! What's it doing in the repos then?? I've just upgraded to it!
>> > > What are the bugs? Is there a thread about them?
>> >
>> >
>> > If you’ve already upgraded and have no issues then you won’t have any
>> > trouble going forward — except perhaps on the next upgrade, if you do it
>> > while the cluster is unhealthy.
>> >
>> > I agree that it’s annoying when these issues make it out. We’ve had
>> ongoing
>> > discussions to try and improve the release process so it’s less
>> drawn-out
>> > and to prevent these upgrade issues from making it through testing, but
>> > nobody has resolved it yet. If anybody has experience working with deb
>> > repositories and handling releases, the Ceph upstream could use some
>> > help... ;)
>> > -Greg
>> >
>> >
>> > >
>> > > Simon
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-05 Thread David Turner

Correct, it's just the the ceph-kvstore-tool for Luminous doesn't have the
ability to migrate between them.  It exists in Jewel 10.2.11 and in Mimic,
but it doesn't exist in Luminous.  There's no structural difference in the
omap backend so I'm planning to just use a Mimic version of the tool to
update my omap backends.

On Mon, Nov 5, 2018 at 4:26 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Not sure I understand that, but starting Luminous, the filestore omap
> backend is rocksdb by default.
>
>
>
> *From: *David Turner 
> *Date: *Monday, November 5, 2018 at 3:25 PM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> Digging into the code a little more, that functionality was added in
> 10.2.11 and 13.0.1, but it still isn't anywhere in the 12.x.x Luminous
> version.  That's so bizarre.
>
>
>
> On Sat, Nov 3, 2018 at 11:56 AM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> Not exactly, this feature was supported in Jewel starting 10.2.11, ref
> https://github.com/ceph/ceph/pull/18010
>
>
>
> I thought you mentioned you were using Luminous 12.2.4.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 5:21 PM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> That makes so much more sense. It seems like RHCS had had this ability
> since Jewel while it was only put into the community version as of Mimic.
> So my version of the version isn't actually capable of changing the backend
> db. Whole digging into the coffee I did find a bug with the creation of the
> rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph
> defaults or any settings in your config file for the db settings. I'm
> working on testing a modified version that should take those settings into
> account. If the fix does work, the fix will be able to apply to a few other
> tools as well that can be used to set up the omap backend db.
>
>
>
> On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> It was Redhat versioned Jewel. But may be more relevantly, we are on
> Ubuntu unlike your case.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 10:24 AM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> Pavan, which version of Ceph were you using when you changed your backend
> to rocksdb?
>
>
>
> On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be suppor

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-05 Thread David Turner

Digging into the code a little more, that functionality was added in
10.2.11 and 13.0.1, but it still isn't anywhere in the 12.x.x Luminous
version.  That's so bizarre.

On Sat, Nov 3, 2018 at 11:56 AM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Not exactly, this feature was supported in Jewel starting 10.2.11, ref
> https://github.com/ceph/ceph/pull/18010
>
>
>
> I thought you mentioned you were using Luminous 12.2.4.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 5:21 PM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> That makes so much more sense. It seems like RHCS had had this ability
> since Jewel while it was only put into the community version as of Mimic.
> So my version of the version isn't actually capable of changing the backend
> db. Whole digging into the coffee I did find a bug with the creation of the
> rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph
> defaults or any settings in your config file for the db settings. I'm
> working on testing a modified version that should take those settings into
> account. If the fix does work, the fix will be able to apply to a few other
> tools as well that can be used to set up the omap backend db.
>
>
>
> On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> It was Redhat versioned Jewel. But may be more relevantly, we are on
> Ubuntu unlike your case.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 10:24 AM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> Pavan, which version of Ceph were you using when you changed your backend
> to rocksdb?
>
>
>
> On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the st

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread David Turner

That makes so much more sense. It seems like RHCS had had this ability
since Jewel while it was only put into the community version as of Mimic.
So my version of the version isn't actually capable of changing the backend
db. Whole digging into the coffee I did find a bug with the creation of the
rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph
defaults or any settings in your config file for the db settings. I'm
working on testing a modified version that should take those settings into
account. If the fix does work, the fix will be able to apply to a few other
tools as well that can be used to set up the omap backend db.

On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi 
wrote:

> It was Redhat versioned Jewel. But may be more relevantly, we are on
> Ubuntu unlike your case.
>
>
>
> *From: *David Turner 
> *Date: *Friday, November 2, 2018 at 10:24 AM
>
>
> *To: *Pavan Rallabhandi 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
>
>
> Pavan, which version of Ceph were you using when you changed your backend
> to rocksdb?
>
>
>
> On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
> prallabha...@walmartlabs.com> wrote:
>
> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner <mailto:drakonst...@gmail.com>
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi <mailt

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread David Turner

Pavan, which version of Ceph were you using when you changed your backend
to rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Yeah, I think this is something to do with the CentOS binaries, sorry that
> I couldn’t be of much help here.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Monday, October 1, 2018 at 1:37 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I tried modifying filestore_rocksdb_options
> by removing compression=kNoCompression as well as setting it
> to compression=kSnappyCompression.  Leaving it with kNoCompression or
> removing it results in the same segfault in the previous log.  Setting it
> to kSnappyCompression resulted in [1] this being logged and the OSD just
> failing to start instead of segfaulting.  Is there anything else you would
> suggest trying before I purge this OSD from the cluster?  I'm afraid it
> might be something with the CentOS binaries.
>
> [1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
> compression = kSnappyCompression
> 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
> Compression type Snappy is not linked with the binary.
> 2018-10-01 17:10:37.135004 7f1415dfcd80 -1
> filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
> :
> 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
> mount object store
> 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
> failed: (1) Operation not permittedESC[0m
>
> On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner <mailto:drakonst...@gmail.com>
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
> Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I got pulled away from this for a while.  The error in the log is
> "abort: Corruption: Snappy not supported or corrupted Snappy compressed
> block contents" and the OSD has 2 settings set to snappy by default,
> async_compressor_type and bluestore_compression_algorithm.  Do either of
> these settings affect the omap store?
>
> On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <mailto:mailto:
> prallabha...@walmartlabs.com> wrote:
> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner <mailto:mailto:drakonst...@gmail.com>

Re: [ceph-users] Mimic - EC and crush rules - clarification

2018-11-01 Thread David Turner

Yes, when creating an EC profile, it automatically creates a CRUSH rule
specific for that EC profile.  You are also correct that 2+1 doesn't really
have any resiliency built in.  2+2 would allow 1 node to go down while
still having your data accessible.  It will use 2x data to raw as opposed
to the 1.5x of 2+1, but it gives you resiliency.  The example in your
command of 3+2 is not possible with your setup.  May I ask why you want EC
on such a small OSD count?  I'm guessing to not use as much storage on your
SSDs, but I would just suggest going with replica with such a small
cluster.  If you have a larger node/OSD count, then you can start seeing if
EC is right for your use case, but if this is production data... I wouldn't
risk it.

When setting the crush rule, it wants the name of it, ssdrule, not 2.

On Thu, Nov 1, 2018 at 1:34 PM Steven Vacaroaia  wrote:

> Hi,
>
> I am trying to create an EC pool on my SSD based OSDs
> and will appreciate if someone clarify / provide advice about the following
>
> - best K + M combination for 4 hosts one OSD per host
>   My understanding is that K+M< OSD but using K=2, M=1 does not provide
> any redundancy
>   ( as soon as 1 OSD is down, you cannot write to the pool)
>   Am I right ?
>
> - assigning crush_rule as per documentation does not seem to work
> If I provide all the crush rule details when I create the EC profile, the
> PGs are being placed on SSD OSDs  AND a crush rule is automatically create
> Is that the right/new way of doing it ?
> EXAMPLE
> ceph osd erasure-code-profile set erasureISA crush-failure-domain=osd k=3
> m=2 crush-root=ssds plugin=isa technique=cauchy crush-device-class=ssd
>
>
>  [root@osd01 ~]#  ceph osd crush rule ls
> replicated_rule
> erasure-code
> ssdrule
> [root@osd01 ~]# ceph osd crush rule dump ssdrule
> {
> "rule_id": 2,
> "rule_name": "ssdrule",
> "ruleset": 2,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -4,
> "item_name": "ssds"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> [root@osd01 ~]# ceph osd pool set test crush_rule 2
> Error ENOENT: crush rule 2 does not exist
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Packages for debian in Ceph repo

2018-10-30 Thread David Turner

What version of qemu-img are you using?  I found [1] this when poking
around on my qemu server when checking for rbd support.  This version (note
it's proxmox) has rbd listed as a supported format.

[1]
# qemu-img -V; qemu-img --help|grep rbd
qemu-img version 2.11.2pve-qemu-kvm_2.11.2-1
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp
ftps gluster host_cdrom host_device http https iscsi iser luks nbd null-aio
null-co parallels qcow qcow2 qed quorum raw rbd replication sheepdog
throttle vdi vhdx vmdk vpc vvfat zeroinit
On Tue, Oct 30, 2018 at 12:08 PM Kevin Olbrich  wrote:

> Is it possible to use qemu-img with rbd support on Debian Stretch?
> I am on Luminous and try to connect my image-buildserver to load images
> into a ceph pool.
>
> root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2
>> rbd:rbd_vms_ssd_01/test_vm
>> qemu-img: Unknown protocol 'rbd'
>
>
> Kevin
>
> Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan <
> abhis...@suse.com>:
>
>> arad...@tma-0.net writes:
>>
>> > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain packages
>> for
>> > Debian? I'm not seeing any, but maybe I'm missing something...
>> >
>> > I'm seeing ceph-deploy install an older version of ceph on the nodes
>> (from the
>> > Debian repo) and then failing when I run "ceph-deploy osd ..." because
>> ceph-
>> > volume doesn't exist on the nodes.
>> >
>> The newer versions of Ceph (from mimic onwards) requires compiler
>> toolchains supporting c++17 which we unfortunately do not have for
>> stretch/jessie yet.
>>
>> -
>> Abhishek
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Balancer module not balancing perfectly

2018-10-30 Thread David Turner

>From the balancer module's code for v 12.2.7 I noticed [1] these lines
which reference [2] these 2 config options for upmap. You might try using
more max iterations or a smaller max deviation to see if you can get a
better balance in your cluster. I would try to start with [3] these
commands/values and see if it improves your balance and/or allows you to
generate a better map.

[1]
https://github.com/ceph/ceph/blob/v12.2.7/src/pybind/mgr/balancer/module.py#L671-L672
[2] upmap_max_iterations (default 10)
upmap_max_deviation (default .01)
[3] ceph config-key set mgr/balancer/upmap_max_iterations 50
ceph config-key set mgr/balancer/upmap_max_deviation .005

On Tue, Oct 30, 2018 at 11:14 AM Steve Taylor 
wrote:

> I have a Luminous 12.2.7 cluster with 2 EC pools, both using k=8 and
> m=2. Each pool lives on 20 dedicated OSD hosts with 18 OSDs each. Each
> pool has 2048 PGs and is distributed across its 360 OSDs with host
> failure domains. The OSDs are identical (4TB) and are weighted with
> default weights (3.73).
>
> Initially, and not surprisingly, the PG distribution was all over the
> place with PG counts per OSD ranging from 40 to 83. I enabled the
> balancer module in upmap mode and let it work its magic, which reduced
> the range of the per-OSD PG counts to 56-61.
>
> While 56-61 is obviously a whole lot better than 40-83, with upmap I
> expected the range to be 56-57. If I run 'ceph balancer optimize
> ' again to attempt to create a new plan I get 'Error EALREADY:
> Unable to find further optimization,or distribution is already
> perfect.' I set the balancer's max_misplaced value to 1 in case that
> was preventing further optimization, but I still get the same error.
>
> I'm sure I'm missing some config option or something that will allow it
> to do better, but thus far I haven't been able to find anything in the
> docs, mailing list archives, or balancer source code that helps. Any
> ideas?
>
>
> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 <(801)%20871-2799> |
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD node reinstallation

2018-10-30 Thread David Turner

Basically it's a good idea to backup your /etc/ceph/ folder to reinstall
the node. Most everything you need will be in there for your osds.

On Tue, Oct 30, 2018, 6:01 AM Luiz Gustavo Tonello <
gustavo.tone...@gmail.com> wrote:

> Thank you guys,
>
> It'll save me a bunch of time, because the process to reallocate OSD files
> is not so fast. :-)
>
>
>
> On Tue, Oct 30, 2018 at 6:15 AM Alexandru Cucu  wrote:
>
>> Don't forget about the cephx keyring if you are using cephx ;)
>>
>> Usually sits in:
>> /var/lib/ceph/bootstrap-osd/ceph.keyring
>>
>> ---
>> Alex
>>
>> On Tue, Oct 30, 2018 at 4:48 AM David Turner 
>> wrote:
>> >
>> > Set noout, reinstall the OS without going the OSDs (including any
>> journal partitions and maintaining any dmcrypt keys if you have
>> encryption), install ceph, make sure the ceph.conf file is correct,zip
>> start OSDs, unset noout once they're back up and in. All of the data the
>> OSD needs to start is on the OSD itself.
>> >
>> > On Mon, Oct 29, 2018, 6:52 PM Luiz Gustavo Tonello <
>> gustavo.tone...@gmail.com> wrote:
>> >>
>> >> Hi list,
>> >>
>> >> I have a situation that I need to reinstall the O.S. of a single node
>> in my OSD cluster.
>> >> This node has 4 OSDs configured, each one has ~4 TB used.
>> >>
>> >> The way that I'm thinking to proceed is to put OSD down (one each
>> time), stop the OSD, reinstall the O.S., and finally add the OSDs again.
>> >>
>> >> But I want to know if there's a way to do this in a more simple
>> process, maybe put OSD in maintenance (noout), reinstall the O.S. without
>> formatting my Storage volumes, install CEPH again and enable OSDs again.
>> >>
>> >> There's a way like these?
>> >>
>> >> I'm running CEPH Jewel.
>> >>
>> >> Best,
>> >> --
>> >> Luiz Gustavo P Tonello.
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Luiz Gustavo P Tonello.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD node reinstallation

2018-10-29 Thread David Turner

Set noout, reinstall the OS without going the OSDs (including any journal
partitions and maintaining any dmcrypt keys if you have encryption),
install ceph, make sure the ceph.conf file is correct,zip start OSDs, unset
noout once they're back up and in. All of the data the OSD needs to start
is on the OSD itself.

On Mon, Oct 29, 2018, 6:52 PM Luiz Gustavo Tonello <
gustavo.tone...@gmail.com> wrote:

> Hi list,
>
> I have a situation that I need to reinstall the O.S. of a single node in
> my OSD cluster.
> This node has 4 OSDs configured, each one has ~4 TB used.
>
> The way that I'm thinking to proceed is to put OSD down (one each time),
> stop the OSD, reinstall the O.S., and finally add the OSDs again.
>
> But I want to know if there's a way to do this in a more simple process,
> maybe put OSD in maintenance (noout), reinstall the O.S. without formatting
> my Storage volumes, install CEPH again and enable OSDs again.
>
> There's a way like these?
>
> I'm running CEPH Jewel.
>
> Best,
> --
> Luiz Gustavo P Tonello.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] reducing min_size on erasure coded pool may allow recovery ?

2018-10-29 Thread David Turner

min_size should be at least k+1 for EC. There are times to use k for
emergencies like you had. I would suggest seeing it back to 3 once your
back to healthy.

As far as why you needed to reduce min_size, my guess would be that
recovery would have happened as long as k copies were up. Were the PG's
refusing to backfill or just hang backfilled yet?

On Mon, Oct 29, 2018, 9:24 PM Chad W Seys  wrote:

> Hi all,
>Recently our cluster lost a drive and a node (3 drives) at the same
> time.  Our erasure coded pools are all k2m2, so if all is working
> correctly no data is lost.
>However, there were 4 PGs that stayed "incomplete" until I finally
> took the suggestion in 'ceph health detail' to reduce min_size . (Thanks
> for the hint!)  I'm not sure what it was (likely 3), but setting it to 2
> caused all PGs to become active (though degraded) and the cluster is on
> path to recovering fully.
>
>In replicated pools, would not ceph create replicas without the need
> to reduce min_size?  It seems odd to not recover automatically if
> possible.  Could someone explain what was going on there?
>
>Also, how to decide what min_size should be?
>
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need advise on proper cluster reweighing

2018-10-28 Thread David Turner

Which version of Ceph are you running? Do you have any kernel clients? If
yes, can which version kernel? These questions are all leading to see if
you can enable the Luminous/Mimic mgr module balancer with upmap. If you
can, it is hands down the best way to balance your cluster.

On Sat, Oct 27, 2018, 9:14 PM Alex Litvak 
wrote:

> I have a cluster using 2 roots.  I attempted to reweigh osds under the
> "default" root used by pool rbd, cephfs-data, cephfs-meta using Cern
> script: crush-reweight-by-utilization.py.  I ran it first and it showed
> 4 candidates (per script default ), it shows final weight and single
> step movements.
>
>   ./crush-reweight-by-utilization.py --pool=rbd
> osd.36 (1.273109 >= 0.675607) [1.00 -> 0.99]
> osd.0 (1.243042 >= 0.675607) [1.00 -> 0.99]
> osd.2 (1.231539 >= 0.675607) [1.00 -> 0.99]
> osd.19 (1.228613 >= 0.675607) [1.00 -> 0.99]
>
> Script advises on all osds in the pool (36 of them if mentioned, see
> below).  Is it safe to take osd.36 as only one osd and reweigh it first?
> I attempted to do it and each step caused some more pgs stuck in
> active+unmapped mode.  I didn't proceed to the end at the moment, but if
> I do continue with osd.36 should pgs distribute correctly or my
> assumption is wrong?  Should I use some other approach, i.e. reweighing
> all osds in the pool or recalculating the weights completely?
>
> This is my first attempt to re-balance cluster properly so any clues are
> appreciated.
>
> Below are various diagnostics in anticipation of questions.
>
> Thank you in advance
>
> ./crush-reweight-by-utilization.py --pool=rbd --num-osds=36
> osd.36 (1.273079 >= 0.675594) [1.00 -> 0.99]
> osd.0 (1.243019 >= 0.675594) [1.00 -> 0.99]
> osd.2 (1.231513 >= 0.675594) [1.00 -> 0.99]
> osd.19 (1.228569 >= 0.675594) [1.00 -> 0.99]
> osd.16 (1.228071 >= 0.675594) [1.00 -> 0.99]
> osd.46 (1.220588 >= 0.675594) [1.00 -> 0.99]
> osd.23 (1.215887 >= 0.675594) [1.00 -> 0.99]
> osd.7 (1.204189 >= 0.675594) [1.00 -> 0.99]
> osd.10 (1.202385 >= 0.675594) [1.00 -> 0.99]
> osd.40 (1.186002 >= 0.675594) [1.00 -> 0.99]
> osd.43 (1.180218 >= 0.675594) [1.00 -> 0.99]
> osd.21 (1.180050 >= 0.675594) [1.00 -> 0.99]
> osd.15 (1.162953 >= 0.675594) [1.00 -> 0.99]
> osd.1 (1.155985 >= 0.675594) [1.00 -> 0.99]
> osd.44 (1.151496 >= 0.675594) [1.00 -> 0.99]
> osd.39 (1.149947 >= 0.675594) [1.00 -> 0.99]
> osd.22 (1.148013 >= 0.675594) [1.00 -> 0.99]
> osd.8 (1.143455 >= 0.675594) [1.00 -> 0.99]
> osd.37 (1.130054 >= 0.675594) [1.00 -> 0.99]
> osd.18 (1.126777 >= 0.675594) [1.00 -> 0.99]
> osd.17 (1.125752 >= 0.675594) [1.00 -> 0.99]
> osd.9 (1.124679 >= 0.675594) [1.00 -> 0.99]
> osd.42 (1.110069 >= 0.675594) [1.00 -> 0.99]
> osd.4 (1.108986 >= 0.675594) [1.00 -> 0.99]
> osd.45 (1.102144 >= 0.675594) [1.00 -> 0.99]
> osd.12 (1.085402 >= 0.675594) [1.00 -> 0.99]
> osd.38 (1.083698 >= 0.675594) [1.00 -> 0.99]
> osd.5 (1.076138 >= 0.675594) [1.00 -> 0.99]
> osd.11 (1.075955 >= 0.675594) [1.00 -> 0.99]
> osd.13 (1.070176 >= 0.675594) [1.00 -> 0.99]
> osd.20 (1.063759 >= 0.675594) [1.00 -> 0.99]
> osd.14 (1.052357 >= 0.675594) [1.00 -> 0.99]
> osd.41 (1.035255 >= 0.675594) [1.00 -> 0.99]
> osd.3 (1.013664 >= 0.675594) [1.00 -> 0.99]
> osd.47 (1.011428 >= 0.675594) [1.00 -> 0.99]
> osd.6 (1.000170 >= 0.675594) [1.00 -> 0.99]
>
> # ceph osd df tree
> ID  WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE  VAR  TYPE NAME
>
> -10 18.0- 20100G  7127G 12973G 35.46 0.63 root 12g
>
>   -9 18.0- 20100G  7127G 12973G 35.46 0.63 datacenter
> la-12g
>   -5  6.0-  6700G  2375G  4324G 35.45 0.63 host
> oss4-la-12g
>   24  1.0  1.0  1116G   409G   706G 36.71 0.65
> osd.24
>   26  1.0  1.0  1116G   373G   743G 33.43 0.59
> osd.26
>   28  1.0  1.0  1116G   414G   702G 37.10 0.66
> osd.28
>   30  1.0  1.0  1116G   453G   663G 40.60 0.72
> osd.30
>   32  1.0  1.0  1116G   342G   774G 30.65 0.54
> osd.32
>   34  1.0  1.0  1116G   382G   734G 34.23 0.61
> osd.34
>   -6  6.0-  6700G  2375G  4324G 35.45 0.63 host
> oss5-la-12g
>   25  1.0  1.0  1116G   383G   733G 34.32 0.61
> osd.25
>   27  1.0  1.0  1116G   388G   728G 34.75 0.62
> osd.27
>   29  1.0  1.0  1116G   381G   734G 34.19 0.61
> osd.29
>   31  1.0  1.0  1116G   424G   692G 38.00 0.67
> osd.31
>   33  1.0  1.0  1116G   418G   698G 37.46 0.67
> osd.33
>   35  1.0  1.0  1116G   379G   736G 34.02 0.60
> osd.35
>   -7  6.0-  6700G  2376G  4323G 35.47 0.63 host
> oss6-la-12g
>   48  1.0  1.0  1116G   410G   705G 36.79 0.65
> osd.48
>   49  1.0  1.0  1116G

Re: [ceph-users] Verifying the location of the wal

2018-10-28 Thread David Turner

If your had a specific location for the wal it would show up there. If
there is no entry for the wal, then it is using the same seeing as the db.

On Sun, Oct 28, 2018, 9:26 PM Robert Stanford 
wrote:

>
>  Mehmet: it doesn't look like wal is mentioned in the osd metadata.  I see
> bluefs slow, bluestore bdev, and bluefs db mentioned only.
>
> On Sun, Oct 28, 2018 at 1:48 PM  wrote:
>
>> IIRC there is a Command like
>>
>> Ceph osd Metadata
>>
>> Where you should be able to find Information like this
>>
>> Hab
>> - Mehmet
>>
>> Am 21. Oktober 2018 19:39:58 MESZ schrieb Robert Stanford <
>> rstanford8...@gmail.com>:
>>>
>>>
>>>  I did exactly this when creating my osds, and found that my total
>>> utilization is about the same as the sum of the utilization of the pools,
>>> plus (wal size * number osds).  So it looks like my wals are actually
>>> sharing OSDs.  But I'd like to be 100% sure... so I am seeking a way to
>>> find out
>>>
>>> On Sun, Oct 21, 2018 at 11:13 AM Serkan Çoban 
>>> wrote:
>>>
 wal and db device will be same if you use just db path during osd
 creation. i do not know how to verify this with ceph commands.
 On Sun, Oct 21, 2018 at 4:17 PM Robert Stanford <
 rstanford8...@gmail.com> wrote:
 >
 >
 >  Thanks Serkan.  I am using --path instead of --dev (dev won't work
 because I'm using VGs/LVs).  The output shows block and block.db, but
 nothing about wal.db.  How can I learn where my wal lives?
 >
 >
 >
 >
 > On Sun, Oct 21, 2018 at 12:43 AM Serkan Çoban 
 wrote:
 >>
 >> ceph-bluestore-tool can show you the disk labels.
 >> ceph-bluestore-tool show-label --dev /dev/sda1
 >> On Sun, Oct 21, 2018 at 1:29 AM Robert Stanford <
 rstanford8...@gmail.com> wrote:
 >> >
 >> >
 >> >  An email from this list stated that the wal would be created in
 the same place as the db, if the db were specified when running ceph-volume
 lvm create, and the db were specified on that command line.  I followed
 those instructions and like the other person writing to this list today, I
 was surprised to find that my cluster usage was higher than the total of
 pools (higher by an amount the same as all my wal sizes on each node
 combined).  This leads me to think my wal actually is on the data disk and
 not the ssd I specified the db should go to.
 >> >
 >> >  How can I verify which disk the wal is on, from the command
 line?  I've searched the net and not come up with anything.
 >> >
 >> >  Thanks and regards
 >> >  R
 >> >
 >> > ___
 >> > ceph-users mailing list
 >> > ceph-users@lists.ceph.com
 >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate/convert replicated pool to EC?

2018-10-26 Thread David Turner

It is indeed adding a placement target and not removing it replacing the
pool. The get/put wouldn't be a rados or even ceph command, you would do it
through an s3 client.

On Fri, Oct 26, 2018, 9:38 AM Matthew Vernon  wrote:

> Hi,
>
> On 26/10/2018 12:38, Alexandru Cucu wrote:
>
> > Have a look at this article:>
> https://ceph.com/geen-categorie/ceph-pool-migration/
>
> Thanks; that all looks pretty hairy especially for a large pool (ceph df
> says 1353T / 428,547,935 objects)...
>
> ...so something a bit more controlled/gradual and less
> manual-error-prone would make me happier!
>
> Regards,
>
> Matthew
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW: move bucket from one placement to another

2018-10-25 Thread David Turner

Resharding a bucket won't affect the data in the bucket.  After you change
the placement for a bucket, you could update where the data is by
re-writing all of the data in the bucket.

On Thu, Oct 25, 2018 at 8:48 AM Jacek Suchenia 
wrote:

> Hi
>
> We have a bucket created with LocationConstraint setting, so
> explicit_placement entries are filled in a bucket. Is there a way to move
> it to other placement?
> I was thinking about editing that data and run manual resharding, but I
> don't know if it's a correct way of solving this problem.
>
> Jacek
>
> --
> Jacek Suchenia
> jacek.suche...@gmail.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate/convert replicated pool to EC?

2018-10-25 Thread David Turner

There are no tools to migrate in either direction between EC and Replica.
You can't even migrate an EC pool to a new EC profile.

With RGW you can create a new data pool and new objects will be written to
the new pool. If your objects have a lifecycle, then eventually you'll be
to the new pool over time. Otherwise you can get there by rewriting all of
the objects manually.

On Thu, Oct 25, 2018, 12:30 PM Matthew Vernon  wrote:

> Hi,
>
> I thought I'd seen that it was possible to migrate a replicated pool to
> being erasure-coded (but not the converse); but I'm failing to find
> anything that says _how_.
>
> Have I misremembered? Can you migrate a replicated pool to EC? (if so,
> how?)
>
> ...our use case is moving our S3 pool which is quite large, so if we can
> convert in-place that would be ideal...
>
> Thanks,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE
> .
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

I don't have enough disk space on the nvme. The DB would overflow before I
reached 25% utilization in the cluster. The disks are 10TB spinners and
would need a minimum of 100 GB of DB space minimum based on early testing.
The official docs recommend 400GB size DB for this size disk. I don't have
enough flash space for that in the 2x nvme disks in those servers.  Hence I
put the WAL soon the nvmes and left the DB on the data disk where it would
have spoiled over to almost immediately anyway.

On Mon, Oct 22, 2018, 6:55 PM solarflow99  wrote:

> Why didn't you just install the DB + WAL on the NVMe?  Is this "data disk"
> still an ssd?
>
>
>
> On Mon, Oct 22, 2018 at 3:34 PM David Turner 
> wrote:
>
>> And by the data disk I mean that I didn't specify a location for the DB
>> partition.
>>
>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>> wrote:
>>
>>> Track down where it says they point to?  Does it match what you expect?
>>> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>>>
>>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
>>> wrote:
>>>
>>>>
>>>>  David - is it ensured that wal and db both live where the symlink
>>>> block.db points?  I assumed that was a symlink for the db, but necessarily
>>>> for the wal, because it can live in a place different than the db.
>>>>
>>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>>> wrote:
>>>>
>>>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
>>>>> at where the symlinks for block and block.wal point to.
>>>>>
>>>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>>>> rstanford8...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>  That's what they say, however I did exactly this and my cluster
>>>>>> utilization is higher than the total pool utilization by about the number
>>>>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>>>>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>>>>
>>>>>>  Thank you, R
>>>>>>
>>>>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> If you specify a db on ssd and data on hdd and not explicitly
>>>>>>> specify a
>>>>>>> device for wal, wal will be placed on same ssd partition with db.
>>>>>>> Placing only wal on ssd or creating separate devices for wal and db
>>>>>>> are
>>>>>>> less common setups.
>>>>>>>
>>>>>>> /Maged
>>>>>>>
>>>>>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>>>>>> > Hi!
>>>>>>> >
>>>>>>> > For sharing SSD between WAL and DB what should be placed on SSD?
>>>>>>> WAL or DB?
>>>>>>> >
>>>>>>> > - Original Message -
>>>>>>> > From: "Maged Mokhtar" 
>>>>>>> > To: "ceph-users" 
>>>>>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>>>>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>>>>>> >
>>>>>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD
>>>>>>> has a partition on an SSD for its DB. Wal is on the regular hard drives.
>>>>>>> Should I move the wal to share the SSD with the DB?
>>>>>>> >
>>>>>>> > Regards
>>>>>>> > R
>>>>>>> >
>>>>>>> >
>>>>>>> > ___
>>>>>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>>>>>> ceph-users@lists.ceph.com ] [
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>>>>>> >
>>>>>>> > you should put wal on the faster device, wal and db could share
>>>>>>> the same ssd partition,
>>>>>>> >
>>>>>>> > Maged
>>>>>>> >
>>>>>>> > ___
>>>>>>> > ceph-users mailing list
>>>>>>> > ceph-users@lists.ceph.com
>>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> > ___
>>>>>>> > ceph-users mailing list
>>>>>>> > ceph-users@lists.ceph.com
>>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>> ___
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

No, it's exactly what I told you it was.  "bluestore_bdev_partition_path"
is the data path.  In all of my scenarios my DB and Data are on the same
partition, hence mine are the same.  Your DB and WAL are on a different
partition from your Data... so your DB partition is different... Whatever
your misunderstanding is about where/why your cluster's usage is
higher/different than you think it is, it has nothing to do with where your
DB and WAL partitions are.

There is a overhead just for having a FS on the disk.  In this case that FS
is bluestore.  You can look at [1] this ML thread from a while ago where I
mentioned a brand new cluster with no data in it and the WAL partitions on
separate disks that it was using about 1.1GB of data per OSD.

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025246.html
On Mon, Oct 22, 2018 at 4:51 PM Robert Stanford 
wrote:

>
>  That's very helpful, thanks.  In your first case above your
> bluefs_db_partition_path and bluestore_bdev_partition path are the same.
> Though I have a different data and db drive, mine are different.  Might
> this explain something?  My root concern is that there is more utilization
> on the cluster than what's in the pools, the excess equal to about wal size
> * number of osds...
>
> On Mon, Oct 22, 2018 at 3:35 PM David Turner 
> wrote:
>
>> My DB doesn't have a specific partition anywhere, but there's still a
>> symlink for it to the data partition.  On my home cluster with all DB, WAL,
>> and Data on the same disk without any partitions specified there is a block
>> symlink but no block.wal symlink.
>>
>> For the cluster with a specific WAL partition, but no DB partition, my
>> OSD paths looks like [1] this.  For my cluster with everything on the same
>> disk, my OSD paths look like [2] this.  Unless you have a specific path for
>> "bluefs_wal_partition_path" then it's going to find itself on the same
>> partition as the db.
>>
>> [1] $ ceph osd metadata 5 | grep path
>> "bluefs_db_partition_path": "/dev/dm-29",
>> "bluefs_wal_partition_path": "/dev/dm-41",
>> "bluestore_bdev_partition_path": "/dev/dm-29",
>>
>> [2] $ ceph osd metadata 5 | grep path
>> "bluefs_db_partition_path": "/dev/dm-5",
>> "bluestore_bdev_partition_path": "/dev/dm-5",
>>
>> On Mon, Oct 22, 2018 at 4:21 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  Let me add, I have no block.wal file (which the docs suggest should be
>>> there).
>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>>
>>> On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
>>> wrote:
>>>
>>>>
>>>>  We're out of sync, I think.  You have your DB on your data disk so
>>>> your block.db symlink points to that disk, right?  There is however no wal
>>>> symlink?  So how would you verify your WAL actually lived on your NVMe?
>>>>
>>>> On Mon, Oct 22, 2018 at 3:07 PM David Turner 
>>>> wrote:
>>>>
>>>>> And by the data disk I mean that I didn't specify a location for the
>>>>> DB partition.
>>>>>
>>>>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>>>>> wrote:
>>>>>
>>>>>> Track down where it says they point to?  Does it match what you
>>>>>> expect?  It does for me.  I have my DB on my data disk and my WAL on a
>>>>>> separate NVMe.
>>>>>>
>>>>>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford <
>>>>>> rstanford8...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>  David - is it ensured that wal and db both live where the symlink
>>>>>>> block.db points?  I assumed that was a symlink for the db, but 
>>>>>>> necessarily
>>>>>>> for the wal, because it can live in a place different than the db.
>>>>>>>
>>>>>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and
>>>>>>>> look at where the symlinks for block and block.wal point to.
>>>>>>>>
>>>>>>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>>>>>>> rstanford8...@gmail.com> wrote:
>>>&g

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

My DB doesn't have a specific partition anywhere, but there's still a
symlink for it to the data partition.  On my home cluster with all DB, WAL,
and Data on the same disk without any partitions specified there is a block
symlink but no block.wal symlink.

For the cluster with a specific WAL partition, but no DB partition, my OSD
paths looks like [1] this.  For my cluster with everything on the same
disk, my OSD paths look like [2] this.  Unless you have a specific path for
"bluefs_wal_partition_path" then it's going to find itself on the same
partition as the db.

[1] $ ceph osd metadata 5 | grep path
"bluefs_db_partition_path": "/dev/dm-29",
"bluefs_wal_partition_path": "/dev/dm-41",
"bluestore_bdev_partition_path": "/dev/dm-29",

[2] $ ceph osd metadata 5 | grep path
"bluefs_db_partition_path": "/dev/dm-5",
"bluestore_bdev_partition_path": "/dev/dm-5",

On Mon, Oct 22, 2018 at 4:21 PM Robert Stanford 
wrote:

>
>  Let me add, I have no block.wal file (which the docs suggest should be
> there).
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>
> On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
> wrote:
>
>>
>>  We're out of sync, I think.  You have your DB on your data disk so your
>> block.db symlink points to that disk, right?  There is however no wal
>> symlink?  So how would you verify your WAL actually lived on your NVMe?
>>
>> On Mon, Oct 22, 2018 at 3:07 PM David Turner 
>> wrote:
>>
>>> And by the data disk I mean that I didn't specify a location for the DB
>>> partition.
>>>
>>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>>> wrote:
>>>
>>>> Track down where it says they point to?  Does it match what you
>>>> expect?  It does for me.  I have my DB on my data disk and my WAL on a
>>>> separate NVMe.
>>>>
>>>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford <
>>>> rstanford8...@gmail.com> wrote:
>>>>
>>>>>
>>>>>  David - is it ensured that wal and db both live where the symlink
>>>>> block.db points?  I assumed that was a symlink for the db, but necessarily
>>>>> for the wal, because it can live in a place different than the db.
>>>>>
>>>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>>>> wrote:
>>>>>
>>>>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
>>>>>> at where the symlinks for block and block.wal point to.
>>>>>>
>>>>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>>>>> rstanford8...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>  That's what they say, however I did exactly this and my cluster
>>>>>>> utilization is higher than the total pool utilization by about the 
>>>>>>> number
>>>>>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too 
>>>>>>> but
>>>>>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>>>>>
>>>>>>>  Thank you, R
>>>>>>>
>>>>>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> If you specify a db on ssd and data on hdd and not explicitly
>>>>>>>> specify a
>>>>>>>> device for wal, wal will be placed on same ssd partition with db.
>>>>>>>> Placing only wal on ssd or creating separate devices for wal and db
>>>>>>>> are
>>>>>>>> less common setups.
>>>>>>>>
>>>>>>>> /Maged
>>>>>>>>
>>>>>>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>>>>>>> > Hi!
>>>>>>>> >
>>>>>>>> > For sharing SSD between WAL and DB what should be placed on SSD?
>>>>>>>> WAL or DB?
>>>>>>>> >
>>>>>>>> > - Original Message -
>>>>>>>> > From: "Maged Mokhtar" 
>>>>>>>> > To: "ceph-users" 
>>>>>>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>>>>

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

Track down where it says they point to?  Does it match what you expect?  It
does for me.  I have my DB on my data disk and my WAL on a separate NVMe.

On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
wrote:

>
>  David - is it ensured that wal and db both live where the symlink
> block.db points?  I assumed that was a symlink for the db, but necessarily
> for the wal, because it can live in a place different than the db.
>
> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
> wrote:
>
>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
>> where the symlinks for block and block.wal point to.
>>
>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  That's what they say, however I did exactly this and my cluster
>>> utilization is higher than the total pool utilization by about the number
>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>
>>>  Thank you, R
>>>
>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>> wrote:
>>>
>>>>
>>>> If you specify a db on ssd and data on hdd and not explicitly specify a
>>>> device for wal, wal will be placed on same ssd partition with db.
>>>> Placing only wal on ssd or creating separate devices for wal and db are
>>>> less common setups.
>>>>
>>>> /Maged
>>>>
>>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>>> > Hi!
>>>> >
>>>> > For sharing SSD between WAL and DB what should be placed on SSD? WAL
>>>> or DB?
>>>> >
>>>> > - Original Message -
>>>> > From: "Maged Mokhtar" 
>>>> > To: "ceph-users" 
>>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>>> >
>>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
>>>> partition on an SSD for its DB. Wal is on the regular hard drives. Should I
>>>> move the wal to share the SSD with the DB?
>>>> >
>>>> > Regards
>>>> > R
>>>> >
>>>> >
>>>> > ___
>>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>>> ceph-users@lists.ceph.com ] [
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>>> >
>>>> > you should put wal on the faster device, wal and db could share the
>>>> same ssd partition,
>>>> >
>>>> > Maged
>>>> >
>>>> > ___
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> > ___
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

And by the data disk I mean that I didn't specify a location for the DB
partition.

On Mon, Oct 22, 2018 at 4:06 PM David Turner  wrote:

> Track down where it says they point to?  Does it match what you expect?
> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>
> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
> wrote:
>
>>
>>  David - is it ensured that wal and db both live where the symlink
>> block.db points?  I assumed that was a symlink for the db, but necessarily
>> for the wal, because it can live in a place different than the db.
>>
>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>> wrote:
>>
>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
>>> where the symlinks for block and block.wal point to.
>>>
>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>> rstanford8...@gmail.com> wrote:
>>>
>>>>
>>>>  That's what they say, however I did exactly this and my cluster
>>>> utilization is higher than the total pool utilization by about the number
>>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>>
>>>>  Thank you, R
>>>>
>>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>>> wrote:
>>>>
>>>>>
>>>>> If you specify a db on ssd and data on hdd and not explicitly specify
>>>>> a
>>>>> device for wal, wal will be placed on same ssd partition with db.
>>>>> Placing only wal on ssd or creating separate devices for wal and db
>>>>> are
>>>>> less common setups.
>>>>>
>>>>> /Maged
>>>>>
>>>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>>>> > Hi!
>>>>> >
>>>>> > For sharing SSD between WAL and DB what should be placed on SSD? WAL
>>>>> or DB?
>>>>> >
>>>>> > - Original Message -
>>>>> > From: "Maged Mokhtar" 
>>>>> > To: "ceph-users" 
>>>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>>>> >
>>>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has
>>>>> a partition on an SSD for its DB. Wal is on the regular hard drives. 
>>>>> Should
>>>>> I move the wal to share the SSD with the DB?
>>>>> >
>>>>> > Regards
>>>>> > R
>>>>> >
>>>>> >
>>>>> > ___
>>>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>>>> ceph-users@lists.ceph.com ] [
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>>>> >
>>>>> > you should put wal on the faster device, wal and db could share the
>>>>> same ssd partition,
>>>>> >
>>>>> > Maged
>>>>> >
>>>>> > ___
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> > ___
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner

You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
where the symlinks for block and block.wal point to.

On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford 
wrote:

>
>  That's what they say, however I did exactly this and my cluster
> utilization is higher than the total pool utilization by about the number
> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
> I've asked here and no one seems to know a way to verify this.  Do you?
>
>  Thank you, R
>
> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
> wrote:
>
>>
>> If you specify a db on ssd and data on hdd and not explicitly specify a
>> device for wal, wal will be placed on same ssd partition with db.
>> Placing only wal on ssd or creating separate devices for wal and db are
>> less common setups.
>>
>> /Maged
>>
>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>> > Hi!
>> >
>> > For sharing SSD between WAL and DB what should be placed on SSD? WAL or
>> DB?
>> >
>> > - Original Message -
>> > From: "Maged Mokhtar" 
>> > To: "ceph-users" 
>> > Sent: Saturday, 20 October, 2018 20:05:44
>> > Subject: Re: [ceph-users] Drive for Wal and Db
>> >
>> > On 20/10/18 18:57, Robert Stanford wrote:
>> >
>> >
>> >
>> >
>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
>> partition on an SSD for its DB. Wal is on the regular hard drives. Should I
>> move the wal to share the SSD with the DB?
>> >
>> > Regards
>> > R
>> >
>> >
>> > ___
>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>> ceph-users@lists.ceph.com ] [
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>> >
>> > you should put wal on the faster device, wal and db could share the
>> same ssd partition,
>> >
>> > Maged
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-22 Thread David Turner

I haven't had crush-compat do anything helpful for balancing my clusters.
upmap has been amazing and balanced my clusters far better than anything
else I've ever seen.  I would go so far as to say that upmap can achieve a
perfect balance.

It seems to evenly distribute the PGs for each pool onto all OSDs that pool
is on.  It does that with a maximum difference of 1PG based on how
divisible the number of PGs are with the number of OSDs you have.  As a
side note, your OSD CRUSH weights should be the default weights for their
size for upmap to be as effective as it can be.

On Sat, Oct 20, 2018 at 3:58 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Ok, I'll try out the balancer end of the upcoming week then (after we've
> fixed a HW-issue with one of our mons
> and the cooling system).
>
> Until then, any further advice and whether upmap is recommended over
> crush-compat (all clients are Luminous) are welcome ;-).
>
> Cheers,
> Oliver
>
> Am 20.10.18 um 21:26 schrieb Janne Johansson:
> > Ok, can't say "why" then, I'd reweigh them somewhat to even it out,
> > 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for
> > the MGRs, a script or just a few manual tweaks might be in order.
> >
> > Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth
> > :
> >>
> >> All OSDs are of the very same size. One OSD host has slightly more
> disks (33 instead of 31), though.
> >> So also that that can't explain the hefty difference.
> >>
> >> I attach the output of "ceph osd tree" and "ceph osd df".
> >>
> >> The crush rule for the ceph_data pool is:
> >> rule cephfs_data {
> >> id 2
> >> type erasure
> >> min_size 3
> >> max_size 6
> >> step set_chooseleaf_tries 5
> >> step set_choose_tries 100
> >> step take default class hdd
> >> step chooseleaf indep 0 type host
> >> step emit
> >> }
> >> So that only considers the hdd device class. EC is done with k=4 m=2.
> >>
> >> So I don't see any imbalance on the hardware level, but only a somewhat
> uneven distribution of PGs.
> >> Am I missing something, or is this really just a case for the ceph
> balancer plugin?
> >> I'm just a bit astonished this effect is so huge.
> >> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even
> distribution without balancing?
> >> But it yields about 100 PGs per OSD, as you can see...
> >>
> >> --
> >> # ceph osd tree
> >> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
> >>  -1   826.26428 root default
> >>  -3 0.43700 host mon001
> >>   0   ssd   0.21799 osd.0   up  1.0 1.0
> >>   1   ssd   0.21799 osd.1   up  1.0 1.0
> >>  -5 0.43700 host mon002
> >>   2   ssd   0.21799 osd.2   up  1.0 1.0
> >>   3   ssd   0.21799 osd.3   up  1.0 1.0
> >> -31 1.81898 host mon003
> >> 230   ssd   0.90999 osd.230 up  1.0 1.0
> >> 231   ssd   0.90999 osd.231 up  1.0 1.0
> >> -10   116.64600 host osd001
> >>   4   hdd   3.64499 osd.4   up  1.0 1.0
> >>   5   hdd   3.64499 osd.5   up  1.0 1.0
> >>   6   hdd   3.64499 osd.6   up  1.0 1.0
> >>   7   hdd   3.64499 osd.7   up  1.0 1.0
> >>   8   hdd   3.64499 osd.8   up  1.0 1.0
> >>   9   hdd   3.64499 osd.9   up  1.0 1.0
> >>  10   hdd   3.64499 osd.10  up  1.0 1.0
> >>  11   hdd   3.64499 osd.11  up  1.0 1.0
> >>  12   hdd   3.64499 osd.12  up  1.0 1.0
> >>  13   hdd   3.64499 osd.13  up  1.0 1.0
> >>  14   hdd   3.64499 osd.14  up  1.0 1.0
> >>  15   hdd   3.64499 osd.15  up  1.0 1.0
> >>  16   hdd   3.64499 osd.16  up  1.0 1.0
> >>  17   hdd   3.64499 osd.17  up  1.0 1.0
> >>  18   hdd   3.64499 osd.18  up  1.0 1.0
> >>  19   hdd   3.64499 osd.19  up  1.0 1.0
> >>  20   hdd   3.64499 osd.20  up  1.0 1.0
> >>  21   hdd   3.64499 osd.21  up  1.0 1.0
> >>  22   hdd   3.64499 osd.22  up  1.0 1.0
> >>  23   hdd   3.64499 osd.23  up  1.0 1.0
> >>  24   hdd   3.64499 osd.24  up  1.0 1.0
> >>  25   hdd   3.64499 osd.25  up  1.0 1.0
> >>  26   hdd   3.64499 osd.26  up  1.0 1.0
> >>  27   hdd   3.64499 osd.27  up  1.0 1.0
> >>  28   hdd   3.64499 osd.28  up  1.0 1.0
> >>  29   hdd   3.64499 osd.29  up  1.0 1.0
> >>  30   hdd   3.64499 osd.30  up  1.0 1.0
> >>  31   hdd   3.64499 osd.31  up  1.0 1.0
> >>  32   hdd   3.64

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-19 Thread David Turner

1) I don't really know about the documentation.  You can always put
together a PR for an update to the docs.  I only know what I've tested
trying to get compression working.

2) If you have permissive in both places, no compression will happen, if
you have aggressive globally for the OSDs and none for the pools, you won't
have any compression happening.  Any pool you set to permissive or
aggressive will compress.  Vice Versa is the same.  If you have the pools
all set to aggressive, then only OSDs with permissive or aggressive will
compress.  That is useful if you have mixed disks with flash and spinners
or something using primary affinity or something to speed things up.

3) I do not know much about the outputs.

4) The only way to compress previously written data is to rewrite it.
There is no process that will compress existing data.

On Fri, Oct 19, 2018 at 7:21 AM Frank Schilder  wrote:

> Hi David,
>
> sorry for the slow response, we had a hell of a week at work.
>
> OK, so I had compression mode set to aggressive on some pools, but the
> global option was not changed, because I interpreted the documentation as
> "pool settings take precedence". To check your advise, I executed
>
>   ceph tell "osd.*" config set bluestore_compression_mode aggressive
>
> and dumped a new file consisting of null-bytes. Indeed, this time I
> observe compressed objects:
>
> [root@ceph-08 ~]# ceph daemon osd.80 perf dump | grep blue
> "bluefs": {
> "bluestore": {
> "bluestore_allocated": 2967207936,
> "bluestore_stored": 3161981179,
> "bluestore_compressed": 24549408,
> "bluestore_compressed_allocated": 261095424,
> "bluestore_compressed_original": 522190848,
>
> Obvious questions that come to my mind:
>
> 1) I think either the documentation is misleading or the implementation is
> not following documented behaviour. I observe that per pool settings do
> *not* override globals, but the documentation says they will. (From doc:
> "Sets the policy for the inline compression algorithm for underlying
> BlueStore. This setting overrides the global setting of bluestore
> compression mode.") Will this be fixed in the future? Should this be
> reported?
>
> Remark: When I look at "compression_mode" under "
> http://docs.ceph.com/docs/luminous/rados/operations/pools/?highlight=bluestore%20compression#set-pool-values";
> it actually looks like a copy-and-paste error. The doc here talks about
> compression algorithm (see quote above) while the compression mode should
> be explained. Maybe that is worth looking at?
>
> 2) If I set the global to aggressive, do I now have to disable compression
> explicitly on pools where I don't want compression or is the pool default
> still "none"? Right now, I seem to observe that compression is still
> disabled by default.
>
> 3) Do you know what the output means? What is the compression ratio?
> bluestore_compressed/bluestore_compressed_original=0.04 or
> bluestore_compressed_allocated/bluestore_compressed_original=0.5? The
> second ratio does not look too impressive given the file contents.
>
> 4) Is there any way to get uncompressed data compressed as a background
> task like scrub?
>
> If you have the time to look at these questions, this would be great. Most
> importantly right now is that I got it to work.
>
> Thanks for your help,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: ceph-users  on behalf of Frank
> Schilder 
> Sent: 12 October 2018 17:00
> To: David Turner
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore compression enabled but no data
> compressed
>
> Hi David,
>
> thanks, now I see what you mean. If you are right, that would mean that
> the documentation is wrong. Under "
> http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values";
> is stated that "Sets inline compression algorithm to use for underlying
> BlueStore. This setting overrides the global setting of bluestore
> compression algorithm". In other words, the global setting should be
> irrelevant if compression is enabled on a pool.
>
> Well, I will try how setting both to "aggressive" or "force" works out and
> let you know.
>
> Thanks and have a nice weekend,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: David Turner 
> Sent: 12 October 2018 16:50:31
> To: Frank Schilder
> Cc: ceph-users@lists.

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-18 Thread David Turner

What are you OSD node stats?  CPU, RAM, quantity and size of OSD disks.
You might need to modify some bluestore settings to speed up the time it
takes to peer or perhaps you might just be underpowering the amount of OSD
disks you're trying to do and your servers and OSD daemons are going as
fast as they can.
On Sat, Oct 13, 2018 at 4:08 PM Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> and a 3rd one:
>
> health: HEALTH_WARN
> 1 MDSs report slow metadata IOs
> 1 MDSs report slow requests
>
> 2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1
> included below; oldest blocked for > 199.922552 secs
> 2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN] slow request 34.829662
> seconds old, received at 2018-10-13 21:43:33.321031:
> client_request(client.216121228:929114 lookup #0x1/.active.lock
> 2018-10-13 21:43:33.321594 caller_uid=0, caller_gid=0{}) currently
> failed to rdlock, waiting
>
> The relevant OSDs are bluestore again running at 100% I/O:
>
> iostat shows:
> sdi  77,00 0,00  580,00   97,00 511032,00   972,00
> 1512,5714,88   22,05   24,576,97   1,48 100,00
>
> so it reads with 500MB/s which completely saturates the osd. And it does
> for > 10 minutes.
>
> Greets,
> Stefan
>
> Am 13.10.2018 um 21:29 schrieb Stefan Priebe - Profihost AG:
> >
> > ods.19 is a bluestore osd on a healthy 2TB SSD.
> >
> > Log of osd.19 is here:
> > https://pastebin.com/raw/6DWwhS0A
> >
> > Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG:
> >> Hi David,
> >>
> >> i think this should be the problem - form a new log from today:
> >>
> >> 2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4 osds down
> >> (OSD_DOWN)
> >> ...
> >> 2018-10-13 20:57:41.268674 mon.a [WRN] Health check update: Reduced data
> >> availability: 3 pgs peering (PG_AVAILABILITY)
> >> ...
> >> 2018-10-13 20:58:08.684451 mon.a [WRN] Health check failed: 1 osds down
> >> (OSD_DOWN)
> >> ...
> >> 2018-10-13 20:58:22.841210 mon.a [WRN] Health check failed: Reduced data
> >> availability: 8 pgs inactive (PG_AVAILABILITY)
> >> 
> >> 2018-10-13 20:58:47.570017 mon.a [WRN] Health check update: Reduced data
> >> availability: 5 pgs inactive (PG_AVAILABILITY)
> >> ...
> >> 2018-10-13 20:58:49.142108 osd.19 [WRN] Monitor daemon marked osd.19
> >> down, but it is still running
> >> 2018-10-13 20:58:53.750164 mon.a [WRN] Health check update: Reduced data
> >> availability: 3 pgs inactive (PG_AVAILABILITY)
> >> ...
> >>
> >> so there is a timeframe of > 90s whee PGs are inactive and unavail -
> >> this would at least explain stalled I/O to me?
> >>
> >> Greets,
> >> Stefan
> >>
> >>
> >> Am 12.10.2018 um 15:59 schrieb David Turner:
> >>> The PGs per OSD does not change unless the OSDs are marked out.  You
> >>> have noout set, so that doesn't change at all during this test.  All of
> >>> your PGs peered quickly at the beginning and then were
> active+undersized
> >>> the rest of the time, you never had any blocked requests, and you
> always
> >>> had 100MB/s+ client IO.  I didn't see anything wrong with your cluster
> >>> to indicate that your clients had any problems whatsoever accessing
> data.
> >>>
> >>> Can you confirm that you saw the same problems while you were running
> >>> those commands?  The next thing would seem that possibly a client isn't
> >>> getting an updated OSD map to indicate that the host and its OSDs are
> >>> down and it's stuck trying to communicate with host7.  That would
> >>> indicate a potential problem with the client being unable to
> communicate
> >>> with the Mons maybe?  Have you completely ruled out any network
> problems
> >>> between all nodes and all of the IPs in the cluster.  What does your
> >>> client log show during these times?
> >>>
> >>> On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG
> >>> mailto:n.fahldi...@profihost.ag>> wrote:
> >>>
> >>> Hi, in our `ceph.conf` we have:
> >>>
> >>>   mon_max_pg_per_osd = 300
> >>>
> >>> While the host is offline (9 OSDs down):
> >>>
> >>>   4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
> >>>
> >>> If all OSDs are online:
> >>>
> >&

Re: [ceph-users] ceph pg/pgp number calculation

2018-10-18 Thread David Turner

Not all pools need the same amount of PGs. When you get to so many pools
you want to start calculating how much data each pool will have. If 1 of
your pools will have 80% of your data in it, it should have 80% of your
PGs. The metadata pools for rgw likely won't need more than 8 or so PGs
each. If your rgw data pool is only going to have a little scratch data,
then it won't need very many PGs either.

On Tue, Oct 16, 2018, 3:35 AM Zhenshi Zhou  wrote:

> Hi,
>
> I have a cluster serving rbd and cephfs storage for a period of
> time. I added rgw in the cluster yesterday and wanted it to server
> object storage. Everything seems good.
>
> What I'm confused is how to calculate the pg/pgp number. As we
> all know, the formula of calculating pgs is:
>
> Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) /
> pool_count
>
> Before I created rgw, the cluster had 3 pools(rbd, cephfs_data,
> cephfs_meta).
> But now it has 8 pools, which object service may use, including
> '.rgw.root',
> 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log' and
> 'defualt.rgw.buckets.index'.
>
> Should I calculate pg number again using new pool number as 8, or should I
> continue to use the old pg number?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD for MON/MGR/MDS

2018-10-15 Thread David Turner

Mgr and MDS do not use physical space on a disk. Mons do use the disk and
benefit from SSDs, but they write a lot of stuff all the time. Depending
why the SSDs aren't suitable for OSDs, they might not be suitable for mons
either.

On Mon, Oct 15, 2018, 7:16 AM ST Wong (ITSC)  wrote:

> Hi all,
>
>
>
> We’ve got some servers with some small size SSD but no hard disks other
> than system disks.  While they’re not suitable for OSD, will the SSD be
> useful for running MON/MGR/MDS?
>
>
>
> Thanks a lot.
>
> Regards,
>
> /st wong
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-12 Thread David Turner

If you go down just a little farther you'll see the settings that you put
into your ceph.conf under the osd section (although I'd probably do
global).  That's where the OSDs get the settings from.  As a note, once
these are set, future writes will be compressed (if they match the
compression settings which you can see there about minimum ratios, blob
sizes, etc).  To compress current data, you need to re-write it.

On Fri, Oct 12, 2018 at 10:41 AM Frank Schilder  wrote:

> Hi David,
>
> thanks for your quick answer. When I look at both references, I see
> exactly the same commands:
>
> ceph osd pool set {pool-name} {key} {value}
>
> where on one page only keys specific for compression are described. This
> is the command I found and used. However, I can't see any compression
> happening. If you know about something else than "ceph osd pool set" -
> commands, please let me know.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: David Turner 
> Sent: 12 October 2018 15:47:20
> To: Frank Schilder
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore compression enabled but no data
> compressed
>
> It's all of the settings that you found in your first email when you
> dumped the configurations and such.
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression
>
> On Fri, Oct 12, 2018 at 7:36 AM Frank Schilder  fr...@dtu.dk>> wrote:
> Hi David,
>
> thanks for your answer. I did enable compression on the pools as described
> in the link you sent below (ceph osd pool set sr-fs-data-test
> compression_mode aggressive, I also tried force to no avail). However, I
> could not find anything on enabling compression per OSD. Could you possibly
> provide a source or sample commands?
>
> Thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: David Turner mailto:drakonst...@gmail.com>>
> Sent: 09 October 2018 17:42
> To: Frank Schilder
> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] bluestore compression enabled but no data
> compressed
>
> When I've tested compression before there are 2 places you need to
> configure compression.  On the OSDs in the configuration settings that you
> mentioned, but also on the [1] pools themselves.  If you have the
> compression mode on the pools set to none, then it doesn't matter what the
> OSDs configuration is and vice versa unless you are using the setting of
> force.  If you want to default compress everything, set pools to passive
> and osds to aggressive.  If you want to only compress specific pools, set
> the osds to passive and the specific pools to aggressive.  Good luck.
>
>
> [1]
> http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values
>
> On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder  fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> wrote:
> I seem to have a problem getting bluestore compression to do anything. I
> followed the documentation and enabled bluestore compression on various
> pools by executing "ceph osd pool set  compression_mode
> aggressive". Unfortunately, it seems like no data is compressed at all. As
> an example, below is some diagnostic output for a data pool used by a
> cephfs:
>
> [root@ceph-01 ~]# ceph --version
> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> (stable)
>
> All defaults are OK:
>
> [root@ceph-01 ~]# ceph --show-config | grep compression
> [...]
> bluestore_compression_algorithm = snappy
> bluestore_compression_max_blob_size = 0
> bluestore_compression_max_blob_size_hdd = 524288
> bluestore_compression_max_blob_size_ssd = 65536
> bluestore_compression_min_blob_size = 0
> bluestore_compression_min_blob_size_hdd = 131072
> bluestore_compression_min_blob_size_ssd = 8192
> bluestore_compression_mode = none
> bluestore_compression_required_ratio = 0.875000
> [...]
>
> Compression is reported as enabled:
>
> [root@ceph-01 ~]# ceph osd pool ls detail
> [...]
> pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10
> object_hash rjenkins pg_num 50 pgp_num 50 last_change 7726 flags
> hashpspool,ec_overwrites stripe_width 24576 compression_algorithm snappy
> compression_mode aggressive application cephfs
> [...]
>
> [root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
> compression_mode: aggressive
> [root@ceph-01 ~]# ceph osd pool get sr-fs-d

Re: [ceph-users] Anyone tested Samsung 860 DCT SSDs?

2018-10-12 Thread David Turner

What do you want to use these for?  "5 Year or 0.2 DWPD" is the durability
of this drive which is absolutely awful for most every use in Ceph.
Possibly if you're using these for data disks (not DB or WAL) and you plan
to have a more durable media to host the DB+WAL on... this could work.  Or
if you're just doing archival storage... but then you should be using much
cheaper spinners.  Back in the days of Filestore and SSD Journals I had
some disks that had 0.3 DWPD and I had to replace all of the disks in under
a year because they ran out of writes.

On Fri, Oct 12, 2018 at 9:55 AM Kenneth Van Alstyne <
kvanalst...@knightpoint.com> wrote:

> Cephers:
> As the subject suggests, has anyone tested Samsung 860 DCT SSDs?
> They are really inexpensive and we are considering buying some to test.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> 
> c: 228-547-8045 <(228)%20547-8045> f: 571-266-3106 <(571)%20266-3106>
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure, or
> distribution is STRICTLY prohibited. If you are not the intended recipient,
> please contact the sender by reply e-mail and destroy all copies of the
> original message.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread David Turner

The PGs per OSD does not change unless the OSDs are marked out.  You have
noout set, so that doesn't change at all during this test.  All of your PGs
peered quickly at the beginning and then were active+undersized the rest of
the time, you never had any blocked requests, and you always had 100MB/s+
client IO.  I didn't see anything wrong with your cluster to indicate that
your clients had any problems whatsoever accessing data.

Can you confirm that you saw the same problems while you were running those
commands?  The next thing would seem that possibly a client isn't getting
an updated OSD map to indicate that the host and its OSDs are down and it's
stuck trying to communicate with host7.  That would indicate a potential
problem with the client being unable to communicate with the Mons maybe?
Have you completely ruled out any network problems between all nodes and
all of the IPs in the cluster.  What does your client log show during these
times?

On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG <
n.fahldi...@profihost.ag> wrote:

> Hi, in our `ceph.conf` we have:
>
>   mon_max_pg_per_osd = 300
>
> While the host is offline (9 OSDs down):
>
>   4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
>
> If all OSDs are online:
>
>   4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD
>
> ... so this doesn't seem to be the issue.
>
> If I understood you right, that's what you've meant. If I got you wrong,
> would you mind to point to one of those threads you mentioned?
>
> Thanks :)
>
> Am 12.10.2018 um 14:03 schrieb Burkhard Linke:
> > Hi,
> >
> >
> > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
> >> I rebooted a Ceph host and logged `ceph status` & `ceph health detail`
> >> every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data
> >> availability: pgs peering'. At the same time some VMs hung as described
> >> before.
> >
> > Just a wild guess... you have 71 OSDs and about 4500 PG with size=3.
> > 13500 PG instance overall, resulting in ~190 PGs per OSD under normal
> > circumstances.
> >
> > If one host is down and the PGs have to re-peer, you might reach the
> > limit of 200 PG/OSDs on some of the OSDs, resulting in stuck peering.
> >
> > You can try to raise this limit. There are several threads on the
> > mailing list about this.
> >
> > Regards,
> > Burkhard
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-12 Thread David Turner

It's all of the settings that you found in your first email when you dumped
the configurations and such.
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#inline-compression

On Fri, Oct 12, 2018 at 7:36 AM Frank Schilder  wrote:

> Hi David,
>
> thanks for your answer. I did enable compression on the pools as described
> in the link you sent below (ceph osd pool set sr-fs-data-test
> compression_mode aggressive, I also tried force to no avail). However, I
> could not find anything on enabling compression per OSD. Could you possibly
> provide a source or sample commands?
>
> Thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: David Turner 
> Sent: 09 October 2018 17:42
> To: Frank Schilder
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore compression enabled but no data
> compressed
>
> When I've tested compression before there are 2 places you need to
> configure compression.  On the OSDs in the configuration settings that you
> mentioned, but also on the [1] pools themselves.  If you have the
> compression mode on the pools set to none, then it doesn't matter what the
> OSDs configuration is and vice versa unless you are using the setting of
> force.  If you want to default compress everything, set pools to passive
> and osds to aggressive.  If you want to only compress specific pools, set
> the osds to passive and the specific pools to aggressive.  Good luck.
>
>
> [1]
> http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values
>
> On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder  fr...@dtu.dk>> wrote:
> I seem to have a problem getting bluestore compression to do anything. I
> followed the documentation and enabled bluestore compression on various
> pools by executing "ceph osd pool set  compression_mode
> aggressive". Unfortunately, it seems like no data is compressed at all. As
> an example, below is some diagnostic output for a data pool used by a
> cephfs:
>
> [root@ceph-01 ~]# ceph --version
> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> (stable)
>
> All defaults are OK:
>
> [root@ceph-01 ~]# ceph --show-config | grep compression
> [...]
> bluestore_compression_algorithm = snappy
> bluestore_compression_max_blob_size = 0
> bluestore_compression_max_blob_size_hdd = 524288
> bluestore_compression_max_blob_size_ssd = 65536
> bluestore_compression_min_blob_size = 0
> bluestore_compression_min_blob_size_hdd = 131072
> bluestore_compression_min_blob_size_ssd = 8192
> bluestore_compression_mode = none
> bluestore_compression_required_ratio = 0.875000
> [...]
>
> Compression is reported as enabled:
>
> [root@ceph-01 ~]# ceph osd pool ls detail
> [...]
> pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10
> object_hash rjenkins pg_num 50 pgp_num 50 last_change 7726 flags
> hashpspool,ec_overwrites stripe_width 24576 compression_algorithm snappy
> compression_mode aggressive application cephfs
> [...]
>
> [root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
> compression_mode: aggressive
> [root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_algorithm
> compression_algorithm: snappy
>
> We dumped a 4Gib file with dd from /dev/zero. Should be easy to compress
> with excellent ratio. Search for a PG:
>
> [root@ceph-01 ~]# ceph pg ls-by-pool sr-fs-data-test
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
>  LOG DISK_LOG STATESTATE_STAMPVERSION  REPORTED UP
>  UP_PRIMARY ACTING   ACTING_PRIMARY
> LAST_SCRUB SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
> 24.0 15  00 0   0  62914560
> 77   77 active+clean 2018-09-14 01:07:14.593007  7698'77 7735:142
> [53,47,36,30,14,55,57,5] 53 [53,47,36,30,14,55,57,5]
>  537698'77 2018-09-14 01:07:14.592966 0'0 2018-09-11
> 08:06:29.309010
>
> There is about 250MB data on the primary OSD, but noting seems to be
> compressed:
>
> [root@ceph-07 ~]# ceph daemon osd.53 perf dump | grep blue
> [...]
> "bluestore_allocated": 313917440,
> "bluestore_stored": 264362803,
> "bluestore_compressed": 0,
> "bluestore_compressed_allocated": 0,
> "bluestore_compressed_original": 0,
> [...]
>
> Just to make sure, I checked one of the objects' contents:
>
> [root@ceph-01 ~]# rados ls -p sr-fs-data-test
&g

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-11 Thread David Turner

ad~1,d83b0~2,d83b4~2,d83b8~1,d83ba~a,d83c5~1,d83c7~1,d83ca~1,d83cc~1,d83ce~1,d83d0~1,d83d2~6,d83d9~3,d83df~1,d83e1~2,d83e5~1,d83e8~1,d83eb~4,d83f0~1,d83f2~1,d83f4~3,d83f8~3,d83fd~2,d8402~1,d8405~1,d8407~1,d840a~2,d840f~1,d8411~1,d8413~3,d8417~3,d841c~4,d8422~4,d8428~2,d842b~1,d842e~1,d8430~1,d8432~5,d843a~1,d843c~3,d8440~5,d8447~1,d844a~1,d844d~1,d844f~1,d8452~1,d8455~1,d8457~1,d8459~2,d845d~2,d8460~1,d8462~3,d8467~1,d8469~1,d846b~2,d846e~2,d8471~4,d8476~6,d847d~3,d8482~1,d8484~1,d8486~2,d8489~2,d848c~1,d848e~1,d8491~4,d8499~1,d849c~3,d84a0~1,d84a2~1,d84a4~3,d84aa~2,d84ad~2,d84b1~4,d84b6~1,d84b8~1,d84ba~1,d84bc~1,d84be~1,d84c0~5,d84c7~4,d84ce~1,d84d0~1,d84d2~2,d84d6~2,d84db~1,d84dd~2,d84e2~2,d84e6~1,d84e9~1,d84eb~4,d84f0~4]
> pool 6 'cephfs_cephstor1_data' replicated size 3 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 1214952 flags
> hashpspool stripe_width 0 application cephfs
> pool 7 'cephfs_cephstor1_metadata' replicated size 3 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change
> 1214952 flags hashpspool stripe_width 0 application cephfs
>
> Am 11.10.2018 um 20:47 schrieb David Turner:
> > My first guess is to ask what your crush rules are.  `ceph osd crush
> > rule dump` along with `ceph osd pool ls detail` would be helpful.  Also
> > if you have a `ceph status` output from a time where the VM RBDs aren't
> > working might explain something.
> >
> > On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck - Profihost AG
> > mailto:n.fahldi...@profihost.ag>> wrote:
> >
> > Hi everyone,
> >
> > since some time we experience service outages in our Ceph cluster
> > whenever there is any change to the HEALTH status. E. g. swapping
> > storage devices, adding storage devices, rebooting Ceph hosts, during
> > backfills ect.
> >
> > Just now I had a recent situation, where several VMs hung after I
> > rebooted one Ceph host. We have 3 replications for each PG, 3 mon, 3
> > mgr, 3 mds and 71 osds spread over 9 hosts.
> >
> > We use Ceph as a storage backend for our Proxmox VE (PVE)
> environment.
> > The outages are in the form of blocked virtual file systems of those
> > virtual machines running in our PVE cluster.
> >
> > It feels similar to stuck and inactive PGs to me. Honestly though I'm
> > not really sure on how to debug this problem or which log files to
> > examine.
> >
> > OS: Debian 9
> > Kernel: 4.12 based upon SLE15-SP1
> >
> > # ceph version
> > ceph version 12.2.8-133-gded2f6836f
> > (ded2f6836f6331a58f5c817fca7bfcd6c58795aa) luminous (stable)
> >
> > Can someone guide me? I'm more than happy to provide more information
> > as needed.
> >
> > Thanks in advance
> > Nils
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-11 Thread David Turner

My first guess is to ask what your crush rules are.  `ceph osd crush rule
dump` along with `ceph osd pool ls detail` would be helpful.  Also if you
have a `ceph status` output from a time where the VM RBDs aren't working
might explain something.

On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck - Profihost AG <
n.fahldi...@profihost.ag> wrote:

> Hi everyone,
>
> since some time we experience service outages in our Ceph cluster
> whenever there is any change to the HEALTH status. E. g. swapping
> storage devices, adding storage devices, rebooting Ceph hosts, during
> backfills ect.
>
> Just now I had a recent situation, where several VMs hung after I
> rebooted one Ceph host. We have 3 replications for each PG, 3 mon, 3
> mgr, 3 mds and 71 osds spread over 9 hosts.
>
> We use Ceph as a storage backend for our Proxmox VE (PVE) environment.
> The outages are in the form of blocked virtual file systems of those
> virtual machines running in our PVE cluster.
>
> It feels similar to stuck and inactive PGs to me. Honestly though I'm
> not really sure on how to debug this problem or which log files to examine.
>
> OS: Debian 9
> Kernel: 4.12 based upon SLE15-SP1
>
> # ceph version
> ceph version 12.2.8-133-gded2f6836f
> (ded2f6836f6331a58f5c817fca7bfcd6c58795aa) luminous (stable)
>
> Can someone guide me? I'm more than happy to provide more information
> as needed.
>
> Thanks in advance
> Nils
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Inconsistent PG, repair doesn't work

2018-10-11 Thread David Turner

As a part of a repair is queuing a deep scrub. As soon as the repair part
is over the deep scrub continues until it is done.

On Thu, Oct 11, 2018, 12:26 PM Brett Chancellor 
wrote:

> Does the "repair" function use the same rules as a deep scrub? I couldn't
> get one to kick off, until I temporarily increased the max_scrubs and
> lowered the scrub_min_interval on all 3 OSDs for that placement group. This
> ended up fixing the issue, so I'll leave this here in case somebody else
> runs into it.
>
> sudo ceph tell 'osd.208' injectargs '--osd_max_scrubs 3'
> sudo ceph tell 'osd.120' injectargs '--osd_max_scrubs 3'
> sudo ceph tell 'osd.235' injectargs '--osd_max_scrubs 3'
> sudo ceph tell 'osd.208' injectargs '--osd_scrub_min_interval 1.0'
> sudo ceph tell 'osd.120' injectargs '--osd_scrub_min_interval 1.0'
> sudo ceph tell 'osd.235' injectargs '--osd_scrub_min_interval 1.0'
> sudo ceph pg repair 75.302
>
> -Brett
>
>
> On Thu, Oct 11, 2018 at 8:42 AM Maks Kowalik 
> wrote:
>
>> Imho moving was not the best idea (a copying attempt would have told if
>> the read error was the case here).
>> Scrubs might don't want to start if there are many other scrubs ongoing.
>>
>> czw., 11 paź 2018 o 14:27 Brett Chancellor 
>> napisał(a):
>>
>>> I moved the file. But the cluster won't actually start any scrub/repair
>>> I manually initiate.
>>>
>>> On Thu, Oct 11, 2018, 7:51 AM Maks Kowalik 
>>> wrote:
>>>
 Based on the log output it looks like you're having a damaged file on
 OSD 235 where the shard is stored.
 To ensure if that's the case you should find the file (using
 81d5654895863d as a part of its name) and try to copy it to another
 directory.
 If you get the I/O error while copying, the next steps would be to
 delete the file, run the scrub on 75.302 and take a deep look at the
 OSD.235 for any other errors.

 Kind regards,
 Maks

>>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] https://ceph-storage.slack.com

2018-10-11 Thread David Turner

I have 4 other slack servers that I'm in for work and personal hobbies.
It's just easier for me to maintain one more slack server than have a
separate application for IRC.

On Thu, Oct 11, 2018, 11:02 AM John Spray  wrote:

> On Thu, Oct 11, 2018 at 8:44 AM Marc Roos 
> wrote:
> >
> >
> > Why slack anyway?
>
> Just because some people like using it.  Don't worry, IRC is still the
> primary channel and lots of people don't use slack.  I'm not on slack,
> for example, which is either a good or bad thing depending on your
> perspective :-D
>
> John
>
> >
> >
> >
> >
> > -Original Message-
> > From: Konstantin Shalygin [mailto:k0...@k0ste.ru]
> > Sent: donderdag 11 oktober 2018 5:11
> > To: ceph-users@lists.ceph.com
> > Subject: *SPAM* Re: [ceph-users] https://ceph-storage.slack.com
> >
> > > why would a ceph slack be invite only?
> >
> > Because this is not Telegram.
> >
> >
> >
> > k
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-10 Thread David Turner

Not a resolution, but an idea that you've probably thought of.  Disabling
logging on any affected OSDs (possibly just all of them) seems like a
needed step to be able to keep working with this cluster to finish the
upgrade and get it healthier.

On Wed, Oct 10, 2018 at 6:37 PM Wido den Hollander  wrote:

>
>
> On 10/11/2018 12:08 AM, Wido den Hollander wrote:
> > Hi,
> >
> > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and 12.2.8 I'm
> > seeing OSDs writing heavily to their logfiles spitting out these lines:
> >
> >
> > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2078000~34000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd22cc000~24000
> > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd230~2
> > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2324000~24000
> > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd26c~24000
> > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc 0x0x55828ae047d0
> > dump  0x15cd2704000~3
> >
> > It goes so fast that the OS-disk in this case can't keep up and become
> > 100% util.
> >
> > This causes the OSD to slow down and cause slow requests and starts to
> flap.
> >
> > It seems that this is *only* happening on OSDs which are the fullest
> > (~85%) on this cluster and they have about ~400 PGs each (Yes, I know,
> > that's high).
> >
>
> After some searching I stumbled upon this Bugzilla report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1600138
>
> That seems to be the same issue, although I'm not 100% sure.
>
> Wido
>
> > Looking at StupidAllocator.cc I see this piece of code:
> >
> > void StupidAllocator::dump()
> > {
> >   std::lock_guard l(lock);
> >   for (unsigned bin = 0; bin < free.size(); ++bin) {
> > ldout(cct, 0) << __func__ << " free bin " << bin << ": "
> >   << free[bin].num_intervals() << " extents" << dendl;
> > for (auto p = free[bin].begin();
> >  p != free[bin].end();
> >  ++p) {
> >   ldout(cct, 0) << __func__ << "  0x" << std::hex << p.get_start()
> > << "~"
> > << p.get_len() << std::dec << dendl;
> > }
> >   }
> > }
> >
> > I'm just wondering why it would spit out these lines and what's causing
> it.
> >
> > Has anybody seen this before?
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] https://ceph-storage.slack.com

2018-10-10 Thread David Turner

I would like an invite to.  drakonst...@gmail.com

On Wed, Sep 19, 2018 at 1:02 PM Gregory Farnum  wrote:

> Done. :)
>
> On Tue, Sep 18, 2018 at 12:15 PM Alfredo Daniel Rezinovsky <
> alfredo.rezinov...@ingenieria.uncuyo.edu.ar> wrote:
>
>> Can anyone add me to this slack?
>>
>> with my email alfrenov...@gmail.com
>>
>> Thanks.
>>
>> --
>> Alfredo Daniel Rezinovsky
>> Director de Tecnologías de Información y Comunicaciones
>> Facultad de Ingeniería - Universidad Nacional de Cuyo
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HEALTH_WARN 2 osd(s) have {NOUP, NODOWN, NOIN, NOOUT} flags set

2018-10-10 Thread David Turner

There is a newer [1] feature to be able to set flags per OSD instead of
cluster wide.  This way you can prevent a problem host from marking its
OSDs down while the rest ofthe cluster is capable of doing so.  [2] These
commands ought to clear up your status.

[1]
http://docs.ceph.com/docs/master/rados/operations/health-checks/#osd-flags

[2] ceph osd rm-noin 3
ceph osd rm-noin 5
ceph osd rm-noin 10

On Tue, Oct 9, 2018 at 1:49 PM Rafael Montes  wrote:

> Hello everyone,
>
>
> I am getting warning messages regarding 3osd's  with noin and noout flags
> set. The osd are in up state.I have run the ceph osd unset noin  on the
> cluster and it does not seem to clear the flags. I have attached status
> files for the cluster.
>
>
> The cluster is running  deepsea-0.8.6-2.21.1.noarch and
> ceph-12.2.8+git.1536505967.080f2248ff-2.15.1.x86_64.
>
>
>
> Has anybody run into this issue and if so how was it resolved?
>
>
> Thanks
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-10 Thread David Turner

I know that it existed, but I've never bothered using it.  In applications
like Python where you can get a different reaction by interacting with it
line by line and setting up an environment it is very helpful.  Ceph,
however, doesn't have any such environment variables that would make this
more useful than the traditional CLI.

On Wed, Oct 10, 2018 at 10:20 AM John Spray  wrote:

> Hi all,
>
> Since time immemorial, the Ceph CLI has had a mode where when run with
> no arguments, you just get an interactive prompt that lets you run
> commands without "ceph" at the start.
>
> I recently discovered that we actually broke this in Mimic[1], and it
> seems that nobody noticed!
>
> So the question is: does anyone actually use this feature?  It's not
> particularly expensive to maintain, but it might be nice to have one
> less path through the code if this is entirely unused.
>
> Cheers,
> John
>
> 1. https://github.com/ceph/ceph/pull/24521
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't remove DeleteMarkers in rgw bucket

2018-10-09 Thread David Turner

I would suggest trying to delete the bucket using radosgw-admin.  If you
can't get that to work, then I would go towards deleting the actual RADOS
objects.  There are a few threads on the ML that talk about manually
deleting a bucket.

On Thu, Sep 20, 2018 at 2:04 PM Sean Purdy  wrote:

> Hi,
>
>
> We have a bucket that we are trying to empty.  Versioning and lifecycle
> was enabled.  We deleted all the objects in the bucket.  But this left a
> whole bunch of Delete Markers.
>
> aws s3api delete-object --bucket B --key K --version-id V is not deleting
> the delete markers.
>
> Any ideas?  We want to delete the bucket so we can reuse the bucket name.
> Alternatively, is there a way to delete a bucket that still contains delete
> markers?
>
>
> $ aws --profile=owner s3api list-object-versions --bucket bucket --prefix
> 0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7
>
> {
>   "DeleteMarkers": [
> {
>   "Owner": {
> "DisplayName": "bucket owner",
> "ID": "owner"
>   },
>   "IsLatest": true,
>   "VersionId": "ZB8ty9c3hxjxV5izmIKM1QwDR6fwnsd",
>   "Key": "0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7",
>   "LastModified": "2018-09-17T16:19:58.187Z"
> }
>   ]
> }
>
> $ aws --profile=owner s3api delete-object --bucket bucket --key
> 0/0/00fff6df-863d-48b5-9089-cc6e7c5997e7 --version-id
> ZB8ty9c3hxjxV5izmIKM1QwDR6fwnsd
>
> returns 0 but the delete marker remains.
>
>
> This bucket was created in 12.2.2, current version of ceph is 12.2.7 via
> 12.2.5
>
>
> Thanks,
>
> Sean
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore compression enabled but no data compressed

2018-10-09 Thread David Turner

When I've tested compression before there are 2 places you need to
configure compression.  On the OSDs in the configuration settings that you
mentioned, but also on the [1] pools themselves.  If you have the
compression mode on the pools set to none, then it doesn't matter what the
OSDs configuration is and vice versa unless you are using the setting of
force.  If you want to default compress everything, set pools to passive
and osds to aggressive.  If you want to only compress specific pools, set
the osds to passive and the specific pools to aggressive.  Good luck.


[1] http://docs.ceph.com/docs/mimic/rados/operations/pools/#set-pool-values

On Tue, Sep 18, 2018 at 7:11 AM Frank Schilder  wrote:

> I seem to have a problem getting bluestore compression to do anything. I
> followed the documentation and enabled bluestore compression on various
> pools by executing "ceph osd pool set  compression_mode
> aggressive". Unfortunately, it seems like no data is compressed at all. As
> an example, below is some diagnostic output for a data pool used by a
> cephfs:
>
> [root@ceph-01 ~]# ceph --version
> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> (stable)
>
> All defaults are OK:
>
> [root@ceph-01 ~]# ceph --show-config | grep compression
> [...]
> bluestore_compression_algorithm = snappy
> bluestore_compression_max_blob_size = 0
> bluestore_compression_max_blob_size_hdd = 524288
> bluestore_compression_max_blob_size_ssd = 65536
> bluestore_compression_min_blob_size = 0
> bluestore_compression_min_blob_size_hdd = 131072
> bluestore_compression_min_blob_size_ssd = 8192
> bluestore_compression_mode = none
> bluestore_compression_required_ratio = 0.875000
> [...]
>
> Compression is reported as enabled:
>
> [root@ceph-01 ~]# ceph osd pool ls detail
> [...]
> pool 24 'sr-fs-data-test' erasure size 8 min_size 7 crush_rule 10
> object_hash rjenkins pg_num 50 pgp_num 50 last_change 7726 flags
> hashpspool,ec_overwrites stripe_width 24576 compression_algorithm snappy
> compression_mode aggressive application cephfs
> [...]
>
> [root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_mode
> compression_mode: aggressive
> [root@ceph-01 ~]# ceph osd pool get sr-fs-data-test compression_algorithm
> compression_algorithm: snappy
>
> We dumped a 4Gib file with dd from /dev/zero. Should be easy to compress
> with excellent ratio. Search for a PG:
>
> [root@ceph-01 ~]# ceph pg ls-by-pool sr-fs-data-test
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
>  LOG DISK_LOG STATESTATE_STAMPVERSION  REPORTED UP
>  UP_PRIMARY ACTING   ACTING_PRIMARY
> LAST_SCRUB SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
>
> 24.0 15  00 0   0  62914560
> 77   77 active+clean 2018-09-14 01:07:14.593007  7698'77 7735:142
> [53,47,36,30,14,55,57,5] 53 [53,47,36,30,14,55,57,5]
>  537698'77 2018-09-14 01:07:14.592966 0'0 2018-09-11
> 08:06:29.309010
>
> There is about 250MB data on the primary OSD, but noting seems to be
> compressed:
>
> [root@ceph-07 ~]# ceph daemon osd.53 perf dump | grep blue
> [...]
> "bluestore_allocated": 313917440,
> "bluestore_stored": 264362803,
> "bluestore_compressed": 0,
> "bluestore_compressed_allocated": 0,
> "bluestore_compressed_original": 0,
> [...]
>
> Just to make sure, I checked one of the objects' contents:
>
> [root@ceph-01 ~]# rados ls -p sr-fs-data-test
> 104.039c
> [...]
> 104.039f
>
> It is 4M chunks ...
> [root@ceph-01 ~]# rados -p sr-fs-data-test stat 104.039f
> sr-fs-data-test/104.039f mtime 2018-09-11 14:39:38.00,
> size 4194304
>
> ... with all zeros:
>
> [root@ceph-01 ~]# rados -p sr-fs-data-test get 104.039f obj
>
> [root@ceph-01 ~]# hexdump -C obj
>   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> ||
> *
> 0040
>
> All as it should be, except for compression. Am I overlooking something?
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error-code 2002/API 405 S3 REST API. Creating a new bucket

2018-10-09 Thread David Turner

Can you outline the process you're using to access the REST API?  It's hard
to troubleshoot this without knowing how you were trying to do this.

On Mon, Sep 17, 2018 at 7:09 PM Michael Schäfer 
wrote:

> Hi,
>
> We have a problem with the radosgw using the S3 REST API.
> Trying to create a new bucket does not work.
> We got an 405 on API level and the  log does indicate an 2002 error.
> Do anybody know, what this error-code does mean? Find the radosgw-log
> attached
>
> Bests,
> Michael
>
> 2018-09-17 11:58:03.388 7f65250c2700  1 == starting new request
> req=0x7f65250b9830 =
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.20::GET
> /egobackup::initializing for trans_id =
> tx00014-005b9f88bb-d393-default
> 2018-09-17 11:58:03.388 7f65250c2700 10 rgw api priority: s3=5 s3website=4
> 2018-09-17 11:58:03.388 7f65250c2700 10 host=85.214.24.54
> 2018-09-17 11:58:03.388 7f65250c2700 20 subdomain= domain=
> in_hosted_domain=0 in_hosted_domain_s3website=0
> 2018-09-17 11:58:03.388 7f65250c2700 20 final domain/bucket subdomain=
> domain= in_hosted_domain=0 in_hosted_domain_s3website=0 s->info.domain=
> s->info.request_
> uri=/egobackup
> 2018-09-17 11:58:03.388 7f65250c2700 20 get_handler
> handler=25RGWHandler_REST_Bucket_S3
> 2018-09-17 11:58:03.388 7f65250c2700 10 handler=25RGWHandler_REST_Bucket_S3
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.81:s3:GET
> /egobackup::getting op 0
> 2018-09-17 11:58:03.388 7f65250c2700 10
> op=32RGWGetBucketLocation_ObjStore_S3
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.86:s3:GET
> /egobackup:get_bucket_location:verifying requester
> 2018-09-17 11:58:03.388 7f65250c2700 20
> rgw::auth::StrategyRegistry::s3_main_strategy_t: trying
> rgw::auth::s3::AWSAuthStrategy
> 2018-09-17 11:58:03.388 7f65250c2700 20 rgw::auth::s3::AWSAuthStrategy:
> trying rgw::auth::s3::S3AnonymousEngine
> 2018-09-17 11:58:03.388 7f65250c2700 20 rgw::auth::s3::S3AnonymousEngine
> denied with reason=-1
> 2018-09-17 11:58:03.388 7f65250c2700 20 rgw::auth::s3::AWSAuthStrategy:
> trying rgw::auth::s3::LocalEngine
> 2018-09-17 11:58:03.388 7f65250c2700 10 get_canon_resource():
> dest=/egobackup?location
> 2018-09-17 11:58:03.388 7f65250c2700 10 string_to_sign:
> GET
> 1B2M2Y8AsgTpgAmY7PhCfg==
>
> Mon, 17 Sep 2018 10:58:03 GMT
> /egobackup?location
> 2018-09-17 11:58:03.388 7f65250c2700 15 string_to_sign=GET
> 1B2M2Y8AsgTpgAmY7PhCfg==
>
> Mon, 17 Sep 2018 10:58:03 GMT
> /egobackup?location
> 2018-09-17 11:58:03.388 7f65250c2700 15 server
> signature=fbEd2DlKyKC8JOXTgMZSXV68ngc=
> 2018-09-17 11:58:03.388 7f65250c2700 15 client
> signature=fbEd2DlKyKC8JOXTgMZSXV68ngc=
> 2018-09-17 11:58:03.388 7f65250c2700 15 compare=0
> 2018-09-17 11:58:03.388 7f65250c2700 20 rgw::auth::s3::LocalEngine granted
> access
> 2018-09-17 11:58:03.388 7f65250c2700 20 rgw::auth::s3::AWSAuthStrategy
> granted access
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.000226:s3:GET
> /egobackup:get_bucket_location:normalizing buckets and tenants
> 2018-09-17 11:58:03.388 7f65250c2700 10 s->object=
> s->bucket=egobackup
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.000235:s3:GET
> /egobackup:get_bucket_location:init permissions
> 2018-09-17 11:58:03.388 7f65250c2700 20 get_system_obj_state:
> rctx=0x7f65250b7a30 obj=default.rgw.meta:root:egobackup
> state=0x55b1bc2e1220 s->prefetch_data=0
> 2018-09-17 11:58:03.388 7f65250c2700 10 cache get:
> name=default.rgw.meta+root+egobackup : miss
> 2018-09-17 11:58:03.388 7f65250c2700 10 cache put:
> name=default.rgw.meta+root+egobackup info.flags=0x0
> 2018-09-17 11:58:03.388 7f65250c2700 10 adding
> default.rgw.meta+root+egobackup to cache LRU end
> 2018-09-17 11:58:03.388 7f65250c2700 10 init_permissions on egobackup[]
> failed, ret=-2002
> 2018-09-17 11:58:03.388 7f65250c2700 20 op->ERRORHANDLER: err_no=-2002
> new_err_no=-2002
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::send_status:
> e=0, sent=24, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::send_header:
> e=0, sent=0, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30
> AccountingFilter::send_content_length: e=0, sent=21, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::send_header:
> e=0, sent=0, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::send_header:
> e=0, sent=0, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::complete_header:
> e=0, sent=159, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::set_account: e=1
> 2018-09-17 11:58:03.388 7f65250c2700 30 AccountingFilter::send_body: e=1,
> sent=219, total=0
> 2018-09-17 11:58:03.388 7f65250c2700 30
> AccountingFilter::complete_request: e=1, sent=0, total=219
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.001272:s3:GET
> /egobackup:get_bucket_location:op status=0
> 2018-09-17 11:58:03.388 7f65250c2700  2 req 20:0.001276:s3:GET
> /egobackup:get_bucket_location:http status=404
> 2018-09-17 11:58:03.388 7f65250c2700  1 =

Re: [ceph-users] radosgw bucket stats vs s3cmd du

2018-10-09 Thread David Turner

Have you looked at your Garbage Collection.  I would guess that your GC is
behind and that radosgw-admin is accounting for that space knowing that it
hasn't been freed up yet, whiles 3cmd doesn't see it since it no longer
shows in the listing.

On Tue, Sep 18, 2018 at 4:45 AM Luis Periquito  wrote:

> Hi all,
>
> I have a couple of very big s3 buckets that store temporary data. We
> keep writing to the buckets some files which are then read and
> deleted. They serve as a temporary storage.
>
> We're writing (and deleting) circa 1TB of data daily in each of those
> buckets, and their size has been mostly stable over time.
>
> The issue has arisen that radosgw-admin bucket stats says one bucket
> is 10T and the other is 4T; but s3cmd du (and I did a sync which
> agrees) says 3.5T and 2.3T respectively.
>
> The bigger bucket suffered from the orphaned objects bug
> (http://tracker.ceph.com/issues/18331). The smaller was created as
> 10.2.3 so it may also had the suffered from the same bug.
>
> Any ideas what could be at play here? How can we reduce actual usage?
>
> trimming part of the radosgw-admin bucket stats output
> "usage": {
> "rgw.none": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 18446744073709551572
> },
> "rgw.main": {
> "size": 10870197197183,
> "size_actual": 10873866362880,
> "size_utilized": 18446743601253967400,
> "size_kb": 10615426951,
> "size_kb_actual": 10619010120,
> "size_kb_utilized": 18014398048099578,
> "num_objects": 1702444
> },
> "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 406462
> }
> },
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-10-01 Thread David Turner

I tried modifying filestore_rocksdb_options by removing
compression=kNoCompression
as well as setting it to compression=kSnappyCompression.  Leaving it with
kNoCompression or removing it results in the same segfault in the previous
log.  Setting it to kSnappyCompression resulted in [1] this being logged
and the OSD just failing to start instead of segfaulting.  Is there
anything else you would suggest trying before I purge this OSD from the
cluster?  I'm afraid it might be something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option
compression = kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument:
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1
filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb
:
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to
mount object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init
failed: (1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> I looked at one of my test clusters running Jewel on Ubuntu 16.04, and
> interestingly I found this(below) in one of the OSD logs, which is
> different from your OSD boot log, where none of the compression algorithms
> seem to be supported. This hints more at how rocksdb was built on CentOS
> for Ceph.
>
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms
> supported:
> 2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
> 2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
> 2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
> 2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0
>
> On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
> wrote:
>
> I see Filestore symbols on the stack, so the bluestore config doesn’t
> affect. And the top frame of the stack hints at a RocksDB issue, and there
> are a whole lot of these too:
>
> “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
> Cannot find Properties block from file.”
>
> It really seems to be something with RocksDB on centOS. I still think
> you can try removing “compression=kNoCompression” from the
> filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be
> enabled.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Thursday, September 27, 2018 at 1:18 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I got pulled away from this for a while.  The error in the log is
> "abort: Corruption: Snappy not supported or corrupted Snappy compressed
> block contents" and the OSD has 2 settings set to snappy by default,
> async_compressor_type and bluestore_compression_algorithm.  Do either of
> these settings affect the omap store?
>
> On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner <mailto:drakonst...@gmail.com>
> Date: Tuesday, September 18, 2018 at 6:01 PM
> To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
> Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Here's the [1] full log from the time the OSD was started to the end
> of the crash dump.  These logs are so hard to parse.  Is there anything
> useful in them?
>
> I did confirm that all perms were set correctly and that the
> superblock was changed to rocksdb before the first time I attempted to
> start the OSD with it's new DB.  This is on a fully Luminous cluster with
> [2] the defaults you mentioned.
>
> [1]
> https://gist.github.com/drakonstein/fa3ac0ad9b2ec1

Re: [ceph-users] mount cephfs from a public network ip of mds

2018-09-30 Thread David Turner

I doubt you have a use case that requires you to have a different public
and private network. Just use 1 network on the 10Gb nics. There have been
plenty of mailing list threads in the last year along with testing and
production experience that indicate that having the networks separated is
not needed for a vast majority of Ceph deployments. It generally just adds
complexity for no noticable gains.

On Sun, Sep 30, 2018, 10:11 PM Joshua Chen 
wrote:

> Hello Paul,
>   Thanks for your reply.
>   Now my clients will be from 140.109 (LAN, the real ip network 1Gb/s) and
> from 10.32 (SAN, a closed 10Gb network). Could I make this public_network
> to be 0.0.0.0? so mon daemon listens on both 1Gb and 10Gb network?
>   Or could I have
> public_network = 140.109.169.0/24, 10.32.67.0/24
> cluster_network = 10.32.67.0/24
>
> does ceph allow 2 (multiple) public_network?
>
>   And I don't want to limit the client read/write speed to be 1Gb/s
> nics unless they don't have 10Gb nic installed. To guarantee clients
> read/write to osd (when they know the details of the location) they should
> be using the fastest nic (10Gb) when available. But other clients with only
> 1Gb nic will go through 140.109.0.0 (1Gb LAN) to ask mon or to read/write
> to osds. This is why my osds also have 1Gb and 10Gb nics with 140.109.0.0
> and 10.32.0.0 networking respectively.
>
> Cheers
> Joshua
>
> On Sun, Sep 30, 2018 at 12:09 PM David Turner 
> wrote:
>
>> The cluster/private network is only used by the OSDs. Nothing else in
>> ceph or its clients communicate using it. Everything other than osd to osd
>> communication uses the public network. That includes the MONs, MDSs,
>> clients, and anything other than an osd talking to an osd. Nothing else
>> other than osd to osd traffic can communicate on the private/cluster
>> network.
>>
>> On Sat, Sep 29, 2018, 6:43 AM Paul Emmerich 
>> wrote:
>>
>>> All Ceph clients will always first connect to the mons. Mons provide
>>> further information on the cluster such as the IPs of MDS and OSDs.
>>>
>>> This means you need to provide the mon IPs to the mount command, not
>>> the MDS IPs. Your first command works by coincidence since
>>> you seem to run the mons and MDS' on the same server.
>>>
>>>
>>> Paul
>>> Am Sa., 29. Sep. 2018 um 12:07 Uhr schrieb Joshua Chen
>>> :
>>> >
>>> > Hello all,
>>> >   I am testing the cephFS cluster so that clients could mount -t ceph.
>>> >
>>> >   the cluster has 6 nodes, 3 mons (also mds), and 3 osds.
>>> >   All these 6 nodes has 2 nic, one 1Gb nic with real ip (140.109.0.0)
>>> and 1 10Gb nic with virtual ip (10.32.0.0)
>>> >
>>> > 140.109. Nic1 1G<-MDS1->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-MDS2->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-MDS3->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD1->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD2->Nic2 10G 10.32.
>>> > 140.109. Nic1 1G<-OSD3->Nic2 10G 10.32.
>>> >
>>> >
>>> >
>>> > and I have the following questions:
>>> >
>>> > 1, can I have both public (140.109.0.0) and cluster (10.32.0.0)
>>> clients all be able to mount this cephfs resource
>>> >
>>> > I want to do
>>> >
>>> > (in a 140.109 network client)
>>> > mount -t ceph mds1(140.109.169.48):/ /mnt/cephfs -o user=,secret=
>>> >
>>> > and also in a 10.32.0.0 network client)
>>> > mount -t ceph mds1(10.32.67.48):/
>>> > /mnt/cephfs -o user=,secret=
>>> >
>>> >
>>> >
>>> >
>>> > Currently, only this 10.32.0.0 clients can mount it. that of public
>>> network (140.109) can not. How can I enable this?
>>> >
>>> > here attached is my ceph.conf
>>> >
>>> > Thanks in advance
>>> >
>>> > Cheers
>>> > Joshua
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH
>>> Freseniusstr. 31h
>>> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A81247+M%C3%BCnchen&entry=gmail&source=g>
>>> 81247 München
>>> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A81247+M%C3%BCnchen&entry=gmail&source=g>
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mount cephfs from a public network ip of mds

2018-09-29 Thread David Turner

The cluster/private network is only used by the OSDs. Nothing else in ceph
or its clients communicate using it. Everything other than osd to osd
communication uses the public network. That includes the MONs, MDSs,
clients, and anything other than an osd talking to an osd. Nothing else
other than osd to osd traffic can communicate on the private/cluster
network.

On Sat, Sep 29, 2018, 6:43 AM Paul Emmerich  wrote:

> All Ceph clients will always first connect to the mons. Mons provide
> further information on the cluster such as the IPs of MDS and OSDs.
>
> This means you need to provide the mon IPs to the mount command, not
> the MDS IPs. Your first command works by coincidence since
> you seem to run the mons and MDS' on the same server.
>
>
> Paul
> Am Sa., 29. Sep. 2018 um 12:07 Uhr schrieb Joshua Chen
> :
> >
> > Hello all,
> >   I am testing the cephFS cluster so that clients could mount -t ceph.
> >
> >   the cluster has 6 nodes, 3 mons (also mds), and 3 osds.
> >   All these 6 nodes has 2 nic, one 1Gb nic with real ip (140.109.0.0)
> and 1 10Gb nic with virtual ip (10.32.0.0)
> >
> > 140.109. Nic1 1G<-MDS1->Nic2 10G 10.32.
> > 140.109. Nic1 1G<-MDS2->Nic2 10G 10.32.
> > 140.109. Nic1 1G<-MDS3->Nic2 10G 10.32.
> > 140.109. Nic1 1G<-OSD1->Nic2 10G 10.32.
> > 140.109. Nic1 1G<-OSD2->Nic2 10G 10.32.
> > 140.109. Nic1 1G<-OSD3->Nic2 10G 10.32.
> >
> >
> >
> > and I have the following questions:
> >
> > 1, can I have both public (140.109.0.0) and cluster (10.32.0.0) clients
> all be able to mount this cephfs resource
> >
> > I want to do
> >
> > (in a 140.109 network client)
> > mount -t ceph mds1(140.109.169.48):/ /mnt/cephfs -o user=,secret=
> >
> > and also in a 10.32.0.0 network client)
> > mount -t ceph mds1(10.32.67.48):/
> > /mnt/cephfs -o user=,secret=
> >
> >
> >
> >
> > Currently, only this 10.32.0.0 clients can mount it. that of public
> network (140.109) can not. How can I enable this?
> >
> > here attached is my ceph.conf
> >
> > Thanks in advance
> >
> > Cheers
> > Joshua
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-27 Thread David Turner

I got pulled away from this for a while.  The error in the log is "abort:
Corruption: Snappy not supported or corrupted Snappy compressed block
contents" and the OSD has 2 settings set to snappy by default,
async_compressor_type and bluestore_compression_algorithm.  Do either of
these settings affect the omap store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner 
> Date: Tuesday, September 18, 2018 at 6:01 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Here's the [1] full log from the time the OSD was started to the end of
> the crash dump.  These logs are so hard to parse.  Is there anything useful
> in them?
>
> I did confirm that all perms were set correctly and that the superblock
> was changed to rocksdb before the first time I attempted to start the OSD
> with it's new DB.  This is on a fully Luminous cluster with [2] the
> defaults you mentioned.
>
> [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
> [2] "filestore_omap_backend": "rocksdb",
> "filestore_rocksdb_options":
> "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",
>
> On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I meant the stack trace hints that the superblock still has leveldb in it,
> have you verified that already?
>
> On 9/18/18, 5:27 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> You should be able to set them under the global section and that
> reminds me, since you are on Luminous already, I guess those values are
> already the default, you can verify from the admin socket of any OSD.
>
>     But the stack trace didn’t hint as if the superblock on the OSD is
> still considering the omap backend to be leveldb and to do with the
> compression.
>
> Thanks,
> -Pavan.
>
> From: David Turner <mailto:drakonst...@gmail.com>
> Date: Tuesday, September 18, 2018 at 5:07 PM
> To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
> Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Are those settings fine to have be global even if not all OSDs on a
> node have rocksdb as the backend?  Or will I need to convert all OSDs on a
> node at the same time?
>
> On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi <mailto:mailto:
> prallabha...@walmartlabs.com> wrote:
> The steps that were outlined for conversion are correct, have you
> tried setting some the relevant ceph conf values too:
>
>     filestore_rocksdb_options =
> "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"
>
> filestore_omap_backend = rocksdb
>
> Thanks,
> -Pavan.
>
> From: ceph-users <mailto:mailto:ceph-users-boun...@lists.ceph.com> on
> behalf of David Turner <mailto:mailto:drakonst...@gmail.com>
> Date: Tuesday, September 18, 2018 at 4:09 PM
> To: ceph-users <mailto:mailto:ceph-users@lists.ceph.com>
> Subject: EXT: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I've finally learned enough about the OSD backend track down this
> issue to what I believe is the root cause.  LevelDB compaction is the
> common thread every time we move data around our cluster.  I've ruled out
> PG subfolder splitting, EC doesn't seem to be the root cause of this, and
> it is cluster wide as opposed to specific hardware.
>
> One of the first things I found after digging into leveldb omap
> compaction was [1] this article with a heading "RocksDB instead of LevelDB"
> which mentions that leveldb was replaced with rocksdb as the default db
> backend for filestore OSDs and was even backported to Jewel because of the
> performance improvements.
>
> I figured there must be a way to be able to upgrade an OSD to use
> rocksdb from leveldb without needing to fully backfill the entire OSD.
> There is [2] this article, but you need to have an active service account
&

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread David Turner

> is reporting failure:1
>
> I'm working on getting things mostly good again with everything on mimic and
> will see if it behaves better.
>
> Thanks for your input on this David.
>
>
> [global]
> mon_initial_members = sephmon1, sephmon2, sephmon3
> mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = 10.1.0.0/16
> osd backfill full ratio = 0.92
> osd failsafe nearfull ratio = 0.90
> osd max object size = 21474836480
> mon max pg per osd = 350
>
> [mon]
> mon warn on legacy crush tunables = false
> mon pg warn max per osd = 300
> mon osd down out subtree limit = host
> mon osd nearfull ratio = 0.90
> mon osd full ratio = 0.97
> mon health preluminous compat warning = false
> osd heartbeat grace = 60
> rocksdb cache size = 1342177280
>
> [mds]
> mds log max segments = 100
> mds log max expiring = 40
> mds bal fragment size max = 20
> mds cache memory limit = 4294967296
>
> [osd]
> osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
> osd recovery delay start = 30
> osd recovery max active = 5
> osd max backfills = 3
> osd recovery threads = 2
> osd crush initial weight = 0
> osd heartbeat interval = 30
> osd heartbeat grace = 60
>
>
> On 09/08/2018 11:24 PM, David Turner wrote:
>
>
> What osd/mon/etc config settings do you have that are not default? It
> might be worth utilizing nodown to stop osds from marking each other down
> and finish the upgrade to be able to set the minimum osd version to mimic.
> Stop the osds in a node, manually mark them down, start them back up in
> mimic. Depending on how bad things are, setting pause on the cluster to
> just finish the upgrade faster might not be a bad idea either.
>
> This should be a simple question, have you confirmed that there are no
> networking problems between the MONs while the elections are happening?
>
> On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek 
> mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> <mailto:kevin.hrp...@ssec.wisc.edu> 
> > wrote:
>
> Hey Sage,
>
> I've posted the file with my email address for the user. It is
> with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
> mons are calling for elections about every minute so I let this
> run for a few elections and saw this node become the leader a
> couple times. Debug logs start around 23:27:30. I had managed to
> get about 850/857 osds up, but it seems that within the last 30
> min it has all gone bad again due to the OSDs reporting each
> other as failed. We relaxed the osd_heartbeat_interval to 30 and
> osd_heartbeat_grace to 60 in an attempt to slow down how quickly
> OSDs are trying to fail each other. I'll put in the
> rocksdb_cache_size setting.
>
> Thanks for taking a look.
>
> Kevin
>
> On 09/08/2018 06:04 PM, Sage Weil wrote:
>
>
> Hi Kevin,
>
> I can't think of any major luminous->mimic changes off the top of my
> head
> that would impact CPU usage, but it's always possible there is
> something
> subtle.  Can you ceph-post-file a the full log from one of your mons
> (preferbably the leader)?
>
> You might try adjusting the rocksdb cache size.. try setting
>
>   rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB
>
> on the mons and restarting?
>
> Thanks!
> sage
>
> On Sat, 8 Sep 2018, Kevin Hrpcek wrote:
>
>
>
> Hello,
>
> I've had a Luminous -> Mimic upgrade go very poorly and my cluster
> is stuck
> with almost all pgs down. One problem is that the mons have
> started to
> re-elect a new quorum leader almost every minute. This is making
> it difficult
> to monitor the cluster and even run any commands on it since at
> least half the
> time a ceph command times out or takes over a minute to return
> results. I've
> looked at the debug logs and it appears there is some timeout
> occurring with
> paxos of about a minute. The msg_dispatch thread of the mons is
> often running
> a core at 100% for about a minute(user time, no iowait). Running
> strace on it
> shows the process is going through all of the mon db files (about
> 6gb in
> store.db/*.sst). Does anyone have an idea of what this timeout is
> or why my
&g

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 974 matches

Mail list logo