Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-18 Thread Chris Taylor

Full disclosure - I have not created an erasure code pool yet!

I have been wanting to do the same thing that you are attempting and 
have these links saved. I believe this is what you are looking for.


This link is for decompiling the CRUSH rules and recompiling:

https://docs.ceph.com/docs/luminous/rados/operations/crush-map-edits/


This link is for creating the EC rules for 4+2 with only 3 hosts:

https://ceph.io/planet/erasure-code-on-small-clusters/


I hope that helps!



Chris


On 2019-10-18 2:55 pm, Salsa wrote:

Ok, I'm lost here.

How am I supposed to write a crush rule?

So far I managed to run:

#ceph osd crush rule dump test -o test.txt

So I can edit the rule. Now I have two problems:

1. Whats the functions and operations to use here? Is there
documentation anywhere abuot this?
2. How may I create a crush rule using this file? 'ceph osd crush rule
create ... -i test.txt' does not work.

Am I taking the wrong approach here?


--
Salsa

Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Friday, October 18, 2019 3:56 PM, Paul Emmerich
 wrote:


Default failure domain in Ceph is "host" (see ec profile), i.e., you
need at least k+m hosts (but at least k+m+1 is better for production
setups).
You can change that to OSD, but that's not a good idea for a
production setup for obvious reasons. It's slightly better to write a
crush rule that explicitly picks two disks on 3 different hosts

Paul



Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://croit.io


croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Oct 18, 2019 at 8:45 PM Salsa sa...@protonmail.com wrote:

> I have probably misunterstood how to create erasure coded pools so I may be 
in need of some theory and appreciate if you can point me to documentation that 
may clarify my doubts.
> I have so far 1 cluster with 3 hosts and 30 OSDs (10 each host).
> I tried to create an erasure code profile like so:
> "
>
> ceph osd erasure-code-profile get ec4x2rs
>
> ==
>
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> "
> If I create a pool using this profile or any profile where K+M > hosts , then 
the pool gets stuck.
> "
>
> ceph -s
>
> 
>
> cluster:
> id: eb4aea44-0c63-4202-b826-e16ea60ed54d
> health: HEALTH_WARN
> Reduced data availability: 16 pgs inactive, 16 pgs incomplete
> 2 pools have too many placement groups
> too few PGs per OSD (4 < min 30)
> services:
> mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 11d)
> mgr: ceph01(active, since 74m), standbys: ceph03, ceph02
> osd: 30 osds: 30 up (since 2w), 30 in (since 2w)
> data:
> pools: 11 pools, 32 pgs
> objects: 0 objects, 0 B
> usage: 32 GiB used, 109 TiB / 109 TiB avail
> pgs: 50.000% pgs not active
> 16 active+clean
> 16 creating+incomplete
>
> ceph osd pool ls
>
> =
>
> test_ec
> test_ec2
> "
> The pool will never leave this "creating+incomplete" state.
> The pools were created like this:
> "
>
> ceph osd pool create test_ec2 16 16 erasure ec4x2rs
>
> 
>
> ceph osd pool create test_ec 16 16 erasure
>
> ===
>
> "
> The default profile pool is created correctly.
> My profiles are like this:
> "
>
> ceph osd erasure-code-profile get default
>
> ==
>
> k=2
> m=1
> plugin=jerasure
> technique=reed_sol_van
>
> ceph osd erasure-code-profile get ec4x2rs
>
> ==
>
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> "
> From what I've read it seems to be possible to create erasure code pools with 
higher than hosts K+M. Is this not so?
> What am I doing wrong? Do I have to create any special crush map rule?
> --
> Salsa
> Sent with ProtonMail Secure Email.
>
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bluestore db & wal use spdk device how to ?

2019-08-05 Thread Chris Hsiang
Hi All,

I have multiple nvme ssd and I wish to use two of them for spdk as
bluestore db & wal

my assumption would be in ceph.conf under osd.conf

put following

bluestore_block_db_path = "spdk::01:00.0"bluestore_block_db_size =
40 * 1024 * 1024 * 1024 (40G)

Then how to prepare osd?
ceph-volume lvm prepare --bluestore --data vg_ceph/lv_sas-sda
--block.db spdk::01:00.0  ?

what if I have a second nvme ssd (:1a:00.0) want to use for different osd  ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-05 Thread Chris Taylor


It seems like since the change to the 9 months cadence it has been bumpy 
for the Debian based installs. Changing to a 12 month cadence sounds 
like a good idea. Perhaps some Debian maintainers can suggest a good 
month for them to get the packages in time for their release cycle.



On 2019-06-05 12:16 pm, Alexandre DERUMIER wrote:

Hi,



- November: If we release Octopus 9 months from the Nautilus release
(planned for Feb, released in Mar) then we'd target this November. We
could shift to a 12 months candence after that.


For the 2 last debian releases, the freeze was around january-february,
november seem to be a good time for ceph release.

- Mail original -
De: "Sage Weil" 
À: "ceph-users" , "ceph-devel"
, d...@ceph.io
Envoyé: Mercredi 5 Juin 2019 17:57:52
Objet: Changing the release cadence

Hi everyone,

Since luminous, we have had the follow release cadence and policy:
- release every 9 months
- maintain backports for the last two releases
- enable upgrades to move either 1 or 2 releases heads
(e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; 
...)


This has mostly worked out well, except that the mimic release received
less attention that we wanted due to the fact that multiple downstream
Ceph products (from Red Has and SUSE) decided to based their next 
release
on nautilus. Even though upstream every release is an "LTS" release, as 
a

practical matter mimic got less attention than luminous or nautilus.

We've had several requests/proposals to shift to a 12 month cadence. 
This

has several advantages:

- Stable/conservative clusters only have to be upgraded every 2 years
(instead of every 18 months)
- Yearly releases are more likely to intersect with downstream
distribution release (e.g., Debian). In the past there have been
problems where the Ceph releases included in consecutive releases of a
distro weren't easily upgradeable.
- Vendors that make downstream Ceph distributions/products tend to
release yearly. Aligning with those vendors means they are more likely
to productize *every* Ceph release. This will help make every Ceph
release an "LTS" release (not just in name but also in terms of
maintenance attention).

So far the balance of opinion seems to favor a shift to a 12 month
cycle[1], especially among developers, so it seems pretty likely we'll
make that shift. (If you do have strong concerns about such a move, now
is the time to raise them.)

That brings us to an important decision: what time of year should we
release? Once we pick the timing, we'll be releasing at that time 
*every
year* for each release (barring another schedule shift, which we want 
to

avoid), so let's choose carefully!

A few options:

- November: If we release Octopus 9 months from the Nautilus release
(planned for Feb, released in Mar) then we'd target this November. We
could shift to a 12 months candence after that.
- February: That's 12 months from the Nautilus target.
- March: That's 12 months from when Nautilus was *actually* released.

November is nice in the sense that we'd wrap things up before the
holidays. It's less good in that users may not be inclined to install 
the

new release when many developers will be less available in December.

February kind of sucked in that the scramble to get the last few things
done happened during the holidays. OTOH, we should be doing what we can
to avoid such scrambles, so that might not be something we should 
factor

in. March may be a bit more balanced, with a solid 3 months before when
people are productive, and 3 months after before they disappear on 
holiday

to address any post-release issues.

People tend to be somewhat less available over the summer months due to
holidays etc, so an early or late summer release might also be less 
than

ideal.

Thoughts? If we can narrow it down to a few options maybe we could do a
poll to gauge user preferences.

Thanks!
sage


[1] https://twitter.com/larsmb/status/1130010208971952129

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-27 Thread Chris
When your node went down, you lost 100% of the copies of the objects that 
were stored on that node, so the cluster had to re-create a copy of 
everything.  When the node came back online (and particularly since your 
usage was near-zero), the cluster discovered that many objects did not 
require changes and were still identical to their counterparts.  The only 
moved objects would have been ones that had changed and ones that needed to 
be moved in order to satisfy the requirements of your crush map for the 
purposes of distribution.


On January 27, 2019 09:47:59 Götz Reinicke  
wrote:

Dear all,

thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any 
complaining.


But wat makes me wonder is: During the downtime the Object got rebalanced 
and placed on the remaining nodes.


With the failed node online, only a couple of hundreds objects where 
misplaced, out of about 35 million.


The question for me is: What happens to the objects on the OSDs that went 
down after the OSDs got back online?


Thanks for feedback



Am 27.01.2019 um 04:17 schrieb Christian Balzer :


Hello,

this is where (depending on your topology) something like:
---
mon_osd_down_out_subtree_limit = host
---
can come in very handy.

Provided you have correct monitoring, alerting and operations, recovering
a down node can often be restored long before any recovery would be
finished and you also avoid the data movement back and forth.
And if you see that recovering the node will take a long time, just
manually set things out for the time being.

Christian

On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:


Dear Chris,

Thanks for your feedback. The node/OSDs in question are part of an erasure 
coded pool and during the weekend the workload should be close to none.


But anyway, I could get a look on the console and on the server; the power 
is up, but I cant use any console, the Loginprompt is shown, but no key is 
accepted.


I’ll have to reboot the server and check what he is complaining about 
tomorrow morning ASAP I can access the server again.


Fingers crossed and regards. Götz





Am 26.01.2019 um 23:41 schrieb Chris :

It sort of depends on your workload/use case.  Recovery operations can be 
computationally expensive.  If your load is light because its the weekend 
you should be able to turn that host back on  as soon as you resolve 
whatever the issue is with minimal impact.  You can also increase the 
priority of the recovery operation to make it go faster if you feel you can 
spare additional IO and it won't affect clients.


We do this in our cluster regularly and have yet to see an issue (given 
that we take care to do it during periods of lower client io)


On January 26, 2019 17:16:38 Götz Reinicke  
wrote:




Hi,

one host out of 10 is down for yet unknown reasons. I guess a power 
failure. I could not yet see the server.


The Cluster is recovering and remapping fine, but still has some objects to 
process.


My question: May I just switch the server back on and in best case, the 24 
OSDs get back online and recovering will do the job without problems.


Or what might be a good way to handle that host? Should I first wait till 
the recover is finished?


Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications



Götz Reinicke
IT-Koordinator
IT-OfficeNet
+49 7141 969 82420
goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
http://www.filmakademie.de


Eintragung Amtsgericht Stuttgart HRB 205016
Vorsitzende des Aufsichtsrates:
Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer:
Prof. Thomas Schadt

Datenschutzerklärung | Transparenzinformation
Data privacy statement | Transparency information


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Chris
It sort of depends on your workload/use case.  Recovery operations can be 
computationally expensive.  If your load is light because its the weekend 
you should be able to turn that host back on  as soon as you resolve 
whatever the issue is with minimal impact.  You can also increase the 
priority of the recovery operation to make it go faster if you feel you can 
spare additional IO and it won't affect clients.


We do this in our cluster regularly and have yet to see an issue (given 
that we take care to do it during periods of lower client io)


On January 26, 2019 17:16:38 Götz Reinicke  
wrote:



Hi,

one host out of 10 is down for yet unknown reasons. I guess a power 
failure. I could not yet see the server.


The Cluster is recovering and remapping fine, but still has some objects to 
process.


My question: May I just switch the server back on and in best case, the 24 
OSDs get back online and recovering will do the job without problems.


Or what might be a good way to handle that host? Should I first wait till 
the recover is finished?


Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Garbage collection growing and db_compaction with small file uploads

2019-01-09 Thread Chris Sarginson
Hi all

I'm seeing some behaviour I wish to check on a Luminous (12.2.10) cluster
that I'm running for rbd and rgw (mostly SATA filestore with NVME journal
with a few SATA only bluestore).  There's a set of dedicated SSD OSDs
running bluestore for the .rgw buckets.index pool and also holding the
.rgw.gc pool

There's a long running upload of small files, which I think is causing a
large amount of leveldb compaction (on filestore nodes) and rocksdb
compaction on bluestore nodes.  The .rgw.buckets bluestore nodes were
exhibiting noticeably higher load than filestore nodes, although this seems
to have been solved following configuring the following options for
bluestore SATA osds:

bluestore cache size hdd = 10737418240
osd memory target = 10737418240

However the bluestore nodes are still showing significantly higher wait CPU
and higher disk IO than filestore nodes, is there anything else that I
should be looking at tuning for bluestore, or is this is expected due to
the loss of file cache with filestore?

Whilst the upload has been running a "radosgw-admin orphans find" was also
being executed, although this was ended manually before completion, as a
significant buildup in garbage collection has occurred.  Looking into this,
it looks like most of the outstanding garbage collection relates to a
single bucket, which was shown to contain a large amount of
multipart/shadow files.  These are now being listed in the radosgw-admin gc
list

# radosgw-admin gc list | grep -c '"oid":'
224557347
# radosgw-admin gc list | grep  '"oid":' | grep -v -c
"default.1084171934.99"
3674322
# radosgw-admin gc list | head -1000 | grep  '"oid":'| grep 1084171934
"oid":
"default.1084171934.99__multipart_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1",
"oid":
"default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_1",
"oid":
"default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_2",
"oid":
"default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_3",
"oid":
"default.1084171934.99__shadow_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.1_4",
"oid":
"default.1084171934.99__multipart_ServerImageBackup/95C48F007C44E36C-00-00.mrimg.tmp.2~MZ7fyct8yAWCUX82e9F-j9q-UJcnheP.2",

Despite running multiple "radosgw-admin gc process" commands alongside our
radosgw processes, which has helped clean up garbage collection in the
past, our gc list is currently continuing to grow.  I believe I can loop
through this manually and use the rados rm command to remove the objects
from the .rgw.buckets pool after having a look through some historic posts
on this list, and then remove the garbage collection objects - is this a
reasonable solution?  Are there any recommendations for dealing with a
garbage collection list of this size?

If there's any additional information I should provide for context here,
please let me know.

Thanks for any help
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Chris
If you added OSDs and then deleted them repeatedly without waiting for 
replication to finish as the cluster attempted to re-balance across them, 
its highly likely that you are permanently missing PGs (especially if the 
disks were zapped each time).


If those 3 down OSDs can be revived there is a (small) chance that you can 
right the ship, but 1400pg/OSD is pretty extreme.  I'm surprised the 
cluster even let you do that - this sounds like a data loss event.



Bring back the 3 OSD and see what those 2 inconsistent pgs look like with 
ceph pg query.


On January 3, 2019 21:59:38 Arun POONIA  wrote:

Hi,

Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy 
tool. Since I was experimenting with tool and ended up deleting OSD nodes 
on new server couple of times.


Now since ceph OSDs are running on new server cluster PGs seems to be 
inactive (10-15%) and they are not recovering or rebalancing. Not sure what 
to do. I tried shutting down OSDs on new server.


Status:
[root@fre105 ~]# ceph -s
2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to 
'/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) 
No such file or directory

 cluster:
   id: adb9ad8e-f458-4124-bf58-7963a8d1391f
   health: HEALTH_ERR
   3 pools have many more objects per pg than average
   373907/12391198 objects misplaced (3.018%)
   2 scrub errors
   9677 PGs pending on creation
   Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 
   2717 pgs stale

   Possible data damage: 2 pgs inconsistent
   Degraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 
   pgs degraded, 1297 pgs undersized

   52486 slow requests are blocked > 32 sec
   9287 stuck requests are blocked > 4096 sec
   too many PGs per OSD (2968 > max 200)

 services:
   mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
   mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
   osd: 39 osds: 36 up, 36 in; 51 remapped pgs
   rgw: 1 daemon active

 data:
   pools:   18 pools, 54656 pgs
   objects: 6050k objects, 10941 GB
   usage:   21727 GB used, 45308 GB / 67035 GB avail
   pgs: 13.073% pgs not active
178350/12391198 objects degraded (1.439%)
373907/12391198 objects misplaced (3.018%)
46177 active+clean
5054  down
1173  stale+down
1084  stale+active+undersized
547   activating
201   stale+active+undersized+degraded
158   stale+activating
96activating+degraded
46stale+active+clean
42activating+remapped
34stale+activating+degraded
23stale+activating+remapped
6 stale+activating+undersized+degraded+remapped
6 activating+undersized+degraded+remapped
2 activating+degraded+remapped
2 active+clean+inconsistent
1 stale+activating+degraded+remapped
1 stale+active+clean+remapped
1 stale+remapped
1 down+remapped
1 remapped+peering

 io:
   client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr

Thanks
--
Arun Poonia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-11-20 Thread Chris Martin
I am also having this problem. Zheng (or anyone else), any idea how to
perform this downgrade on a node that is also a monitor and an OSD
node?

dpkg complains of a dependency conflict when I try to install
ceph-mds_13.2.1-1xenial_amd64.deb:

```
dpkg: dependency problems prevent configuration of ceph-mds:
 ceph-mds depends on ceph-base (= 13.2.1-1xenial); however:
  Version of ceph-base on system is 13.2.2-1xenial.
```

I don't think I want to downgrade ceph-base to 13.2.1.

Thank you,
Chris Martin

> Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
> marking mds repaird can resolve this.
>
> Yan, Zheng
> On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin  wrote:
> >
> > Update:
> > I discovered http://tracker.ceph.com/issues/24236 and 
> > https://github.com/ceph/ceph/pull/22146
> > Make sure that it is not relevant in your case before proceeding to 
> > operations that modify on-disk data.
> >
> >
> > On 6.10.2018, at 03:17, Sergey Malinin  wrote:
> >
> > I ended up rescanning the entire fs using alternate metadata pool approach 
> > as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> > The process has not competed yet because during the recovery our cluster 
> > encountered another problem with OSDs that I got fixed yesterday (thanks to 
> > Igor Fedotov @ SUSE).
> > The first stage (scan_extents) completed in 84 hours (120M objects in data 
> > pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
> > OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
> > faster than extents scan.
> > As to root cause -- in my case I recall that during upgrade I had forgotten 
> > to restart 3 OSDs, one of which was holding metadata pool contents, before 
> > restarting MDS daemons and that seemed to had an impact on MDS journal 
> > corruption, because when I restarted those OSDs, MDS was able to start up 
> > but soon failed throwing lots of 'loaded dup inode' errors.
> >
> >
> > On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky  > gmail.com> wrote:
> >
> > Same problem...
> >
> > # cephfs-journal-tool --journal=purge_queue journal inspect
> > 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
> > Overall journal integrity: DAMAGED
> > Objects missing:
> >   0x16c
> > Corrupt regions:
> >   0x5b00-
> >
> > Just after upgrade to 13.2.2
> >
> > Did you fixed it?
> >
> >
> > On 26/09/18 13:05, Sergey Malinin wrote:
> >
> > Hello,
> > Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
> > After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
> > damaged. Resetting purge_queue does not seem to work well as journal still 
> > appears to be damaged.
> > Can anybody help?
> >
> > mds log:
> >
> >   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
> > to version 586 from mon.2
> >   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
> > am now mds.0.583
> >   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
> > state change up:rejoin --> up:active
> >   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
> > successful recovery!
> > 
> >-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
> > Decode error at read_pos=0x322ec6636
> >-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
> > set_want_state: up:active -> down:damaged
> >-36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
> > down:damaged seq 137
> >-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
> > _send_mon_message to mon.ceph3 at mon:6789/0
> >-34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
> > mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
> > 0x563b321ad480 con 0
> > 
> > -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
> > mon:6789/0 conn(0x563b3213e000 :-1 
> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
> > 29 0x563b321ab880 mdsbeaco
> > n(85106/mds2 down:damaged seq 311 v587) v7
> > -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
> > mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
> >  129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
> > 000
> > -1> 2018-09-26 18:42:32.743 7f70f98b5700

Re: [ceph-users] slow ops after cephfs snapshot removal

2018-11-09 Thread Chris Taylor




> On Nov 9, 2018, at 1:38 PM, Gregory Farnum  wrote:
> 
>> On Fri, Nov 9, 2018 at 2:24 AM Kenneth Waegeman  
>> wrote:
>> Hi all,
>> 
>> On Mimic 13.2.1, we are seeing blocked ops on cephfs after removing some 
>> snapshots:
>> 
>> [root@osd001 ~]# ceph -s
>>cluster:
>>  id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47
>>  health: HEALTH_WARN
>>  5 slow ops, oldest one blocked for 1162 sec, mon.mds03 has 
>> slow ops
>> 
>>services:
>>  mon: 3 daemons, quorum mds01,mds02,mds03
>>  mgr: mds02(active), standbys: mds03, mds01
>>  mds: ceph_fs-2/2/2 up  {0=mds03=up:active,1=mds01=up:active}, 1 
>> up:standby
>>  osd: 544 osds: 544 up, 544 in
>> 
>>io:
>>  client:   5.4 KiB/s wr, 0 op/s rd, 0 op/s wr
>> 
>> [root@osd001 ~]# ceph health detail
>> HEALTH_WARN 5 slow ops, oldest one blocked for 1327 sec, mon.mds03 has 
>> slow ops
>> SLOW_OPS 5 slow ops, oldest one blocked for 1327 sec, mon.mds03 has slow ops
>> 
>> [root@osd001 ~]# ceph -v
>> ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
>> (stable)
>> 
>> Is this a known issue?
> 
> It's not exactly a known issue, but from the output and story you've got here 
> it looks like the OSDs are deleting the snapshot data too fast and the MDS 
> isn't getting quick enough replies? Or maybe you have an overlarge CephFS 
> directory which is taking a long time to clean up somehow; you should get the 
> MDS ops and the MDS' objecter ops in flight and see what specifically is 
> taking so long.
> -Greg

We had a similar issue on ceph 10.2 and RBD images. It was fixed by slowing 
down snapshot removal by adding this to the ceph.conf. 

[osd]
osd snap trim sleep = 0.6



>  
>> 
>> Cheers,
>> 
>> Kenneth
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-18 Thread Chris Sarginson
Hi Tom,

I used a slightly modified version of your script to generate a comparative
list to mine (echoing out the bucket name, id and actual_id), which has
returned substantially more indexes than mine, including a number that
don't show any indication of resharding having been run, or versioning
being enabled, including some with only minor different bucket_ids:

  5813 buckets_with_multiple_reindexes2.txt (my script)
  7999 buckets_with_multiple_reindexes3.txt (modified Tomasz script)

For example a bucket has 2 entries:

default.23404.6
default.23407.8

running "radosgw-admin bucket stats" against this bucket shows the current
id as default.23407.9

None of the indexes (including the active one) shows multiple shards, or
any resharding activities.

Using the command:
rados -p ,rgw,buckets.index listomapvals .dir.${id}

Shows the other (lower) index ids as being empty, and the current one
containing the index data.

I'm wondering if it is possible some of these are remnants from upgrades
(this cluster started as giant and has been upgraded through the LTS
releases to Luminous)?  Using radosgw-admin metadata get bucket.instance on
my sample bucket shows different "ver" information between them all:

old:
"ver": {


"tag": "__17wYsZGbXIhRKtx3goicMV",
"ver": 1
},
"mtime": "2014-03-24 15:45:03.00Z"

"ver": {
"tag": "_x5RWprsckrL3Bj8h7Mbwklt",
"ver": 1
},
"mtime": "2014-03-24 15:43:31.00Z"

active:
"ver": {
"tag": "_6sTOABOHCGTSZ-EEIZ29VSN",
"ver": 4
},
"mtime": "2017-08-10 15:06:38.940464Z",

This obviously still leaves me with the original issue noticed, which is
multiple instances of buckets that seem to have been repeatedly resharded
to the same number of shards as the currently active index.  From having a
search around the tracker it seems like this may be worth following -
"Aborted dynamic resharding should clean up created bucket index objs" :

https://tracker.ceph.com/issues/35953

Again, any other suggestions or ideas are greatly welcomed on this :)

Chris

On Wed, 17 Oct 2018 at 12:29 Tomasz Płaza  wrote:

> Hi,
>
> I have a similar issue, and created a simple bash file to delete old
> indexes (it is PoC and have not been tested on production):
>
> for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`
> do
>   actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
>   for instance in `radosgw-admin metadata list bucket.instance | jq -r
> '.[]' | grep ${bucket}: | cut -d ':' -f 2`
>   do
> if [ "$actual_id" != "$instance" ]
> then
>   radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
>   radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
> fi
>   done
> done
>
> I find it more readable than mentioned one liner. Any sugestions on this
> topic are greatly appreciated.
> Tom
>
> Hi,
>
> Having spent some time on the below issue, here are the steps I took to
> resolve the "Large omap objects" warning.  Hopefully this will help others
> who find themselves in this situation.
>
> I got the object ID and OSD ID implicated from the ceph cluster logfile on
> the mon.  I then proceeded to the implicated host containing the OSD, and
> extracted the implicated PG by running the following, and looking at which
> PG had started and completed a deep-scrub around the warning being logged:
>
> grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
> omap|deep-scrub)'
>
> If the bucket had not been sharded sufficiently (IE the cluster log showed
> a "Key Count" or "Size" over the thresholds), I ran through the manual
> sharding procedure (shown here:
> https://tracker.ceph.com/issues/24457#note-5)
>
> Once this was successfully sharded, or if the bucket was previously
> sufficiently sharded by Ceph prior to disabling the functionality I was
> able to use the following command (seemingly undocumented for Luminous
> http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):
>
> radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}
>
> I then issued a ceph pg deep-scrub against the PG that had contained the
> Large omap object.
>
> Once I had completed this procedure, my Large omap object warnings went
> away and the cluster returned to HEALTH_OK.
>
> However our radosgw bucket indexes pool now seems to be using
> substantially more space than previously.  Having looked initially at this
> bug, and in particular the first comment:
>
> http://tracker.ceph.com/issues/34307#note-1

Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-16 Thread Chris Sarginson
Hi,

Having spent some time on the below issue, here are the steps I took to
resolve the "Large omap objects" warning.  Hopefully this will help others
who find themselves in this situation.

I got the object ID and OSD ID implicated from the ceph cluster logfile on
the mon.  I then proceeded to the implicated host containing the OSD, and
extracted the implicated PG by running the following, and looking at which
PG had started and completed a deep-scrub around the warning being logged:

grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
omap|deep-scrub)'

If the bucket had not been sharded sufficiently (IE the cluster log showed
a "Key Count" or "Size" over the thresholds), I ran through the manual
sharding procedure (shown here: https://tracker.ceph.com/issues/24457#note-5
)

Once this was successfully sharded, or if the bucket was previously
sufficiently sharded by Ceph prior to disabling the functionality I was
able to use the following command (seemingly undocumented for Luminous
http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):

radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}

I then issued a ceph pg deep-scrub against the PG that had contained the
Large omap object.

Once I had completed this procedure, my Large omap object warnings went
away and the cluster returned to HEALTH_OK.

However our radosgw bucket indexes pool now seems to be using substantially
more space than previously.  Having looked initially at this bug, and in
particular the first comment:

http://tracker.ceph.com/issues/34307#note-1

I was able to extract a number of bucket indexes that had apparently been
resharded, and removed the legacy index using the radosgw-admin bi purge
--bucket ${bucket} ${marker}.  I am still able  to perform a radosgw-admin
metadata get bucket.instance:${bucket}:${marker} successfully, however now
when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is
returned.  Even after this, we were still seeing extremely high disk usage
of our OSDs containing the bucket indexes (we have a dedicated pool for
this).  I then modified the one liner referenced in the previous link as
follows:

 grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F ":" '{print
$2}' | tr -d '",' | while read -r bucket; do read -r id; read -r marker; [
"$id" == "$marker" ] && true || NEWID=`radosgw-admin --id rgw.ceph-rgw-1
metadata get bucket.instance:${bucket}:${marker} | python -c 'import sys,
json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ] && [ ${NEWID} !=
${bucket} ] ; then echo "$bucket $NEWID"; fi; NEWID=`radosgw-admin --id
rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${NEWID} | python -c
'import sys, json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
done; done > buckets_with_multiple_reindexes2.txt

This loops through the buckets that have a different marker/bucket_id, and
looks to see if a new_bucket_instance_id is there, and if so will loop
through until there is no longer a "new_bucket_instance_id".  After letting
this complete, this suggests that I have over 5000 indexes for 74 buckets,
some of these buckets have > 100 indexes apparently.

:~# awk '{print $1}' buckets_with_multiple_reindexes2.txt | uniq | wc -l
74
~# wc -l buckets_with_multiple_reindexes2.txt
5813 buckets_with_multiple_reindexes2.txt

This is running a single realm, multiple zone configuration, and no multi
site sync, but the closest I can find to this issue is this bug
https://tracker.ceph.com/issues/24603

Should I be OK to loop through these indexes and remove any with a
reshard_status of 2, a new_bucket_instance_id that does not match the
bucket_instance_id returned by the command:

radosgw-admin bucket stats --bucket ${bucket}

I'd ideally like to get to a point where I can turn dynamic sharding back
on safely for this cluster.

Thanks for any assistance, let me know if there's any more information I
should provide
Chris

On Thu, 4 Oct 2018 at 18:22 Chris Sarginson  wrote:

> Hi,
>
> Thanks for the response - I am still unsure as to what will happen to the
> "marker" reference in the bucket metadata, as this is the object that is
> being detected as Large.  Will the bucket generate a new "marker" reference
> in the bucket metadata?
>
> I've been reading this page to try and get a better understanding of this
> http://docs.ceph.com/docs/luminous/radosgw/layout/
>
> However I'm no clearer on this (and what the "marker" is used for), or why
> there are multiple separate "bucket_id" values (with different mtime
> stamps) that all show as having the same number

Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-04 Thread Chris Sarginson
Hi,

Thanks for the response - I am still unsure as to what will happen to the
"marker" reference in the bucket metadata, as this is the object that is
being detected as Large.  Will the bucket generate a new "marker" reference
in the bucket metadata?

I've been reading this page to try and get a better understanding of this
http://docs.ceph.com/docs/luminous/radosgw/layout/

However I'm no clearer on this (and what the "marker" is used for), or why
there are multiple separate "bucket_id" values (with different mtime
stamps) that all show as having the same number of shards.

If I were to remove the old bucket would I just be looking to execute

rados - p .rgw.buckets.index rm .dir.default.5689810.107

Is the differing marker/bucket_id in the other buckets that was found also
an indicator?  As I say, there's a good number of these, here's some
additional examples, though these aren't necessarily reporting as large
omap objects:

"BUCKET1", "default.281853840.479", "default.105206134.5",
"BUCKET2", "default.364663174.1", "default.349712129.3674",

Checking these other buckets, they are exhibiting the same sort of symptoms
as the first (multiple instances of radosgw-admin metadata get showing what
seem to be multiple resharding processes being run, with different mtimes
recorded).

Thanks
Chris

On Thu, 4 Oct 2018 at 16:21 Konstantin Shalygin  wrote:

> Hi,
>
> Ceph version: Luminous 12.2.7
>
> Following upgrading to Luminous from Jewel we have been stuck with a
> cluster in HEALTH_WARN state that is complaining about large omap objects.
> These all seem to be located in our .rgw.buckets.index pool.  We've
> disabled auto resharding on bucket indexes due to seeming looping issues
> after our upgrade.  We've reduced the number reported of reported large
> omap objects by initially increasing the following value:
>
> ~# ceph daemon mon.ceph-mon-1 config get
> osd_deep_scrub_large_omap_object_value_sum_threshold
> {
> "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648 
> <(214)%20748-3648>"
> }
>
> However we're still getting a warning about a single large OMAP object,
> however I don't believe this is related to an unsharded index - here's the
> log entry:
>
> 2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
> cluster [WRN] Large omap object found. Object:
> 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
> (bytes): 4458647149 <(445)%20864-7149>
>
> The object in the logs is the "marker" object, rather than the bucket_id -
> I've put some details regarding the bucket here:
> https://pastebin.com/hW53kTxL
>
> The bucket limit check shows that the index is sharded, so I think this
> might be related to versioning, although I was unable to get confirmation
> that the bucket in question has versioning enabled through the aws
> cli(snipped debug output below)
>
> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
> headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
> 'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default',
> 'content-type': 'application/xml'}
> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
> body:
>  xmlns="http://s3.amazonaws.com/doc/2006-03-01/;>
>
> After dumping the contents of large omap object mentioned above into a file
> it does seem to be a simple listing of the bucket contents, potentially an
> old index:
>
> ~# wc -l omap_keys
> 17467251 omap_keys
>
> This is approximately 5 million below the currently reported number of
> objects in the bucket.
>
> When running the commands listed 
> here:http://tracker.ceph.com/issues/34307#note-1
>
> The problematic bucket is listed in the output (along with 72 other
> buckets):
> "CLIENTBUCKET", "default.294495648.690", "default.5689810.107"
>
> As this tests for bucket_id and marker fields not matching to print out the
> information, is the implication here that both of these should match in
> order to fully migrate to the new sharded index?
>
> I was able to do a "metadata get" using what appears to be the old index
> object ID, which seems to support this (there's a "new_bucket_instance_id"
> field, containing a newer "bucket_id" and reshard_status is 2, which seems
> to suggest it has completed).
>
> I am able to take the "new_bucket_instance_id" and get additional metadata
> about the bucket, each time I do this I get a slightly newer
> "new_bucket_instance_id", until it stops suggesting updated indexes.
>
> It's probably worth po

[ceph-users] Resolving Large omap objects in RGW index pool

2018-10-04 Thread Chris Sarginson
Hi,

Ceph version: Luminous 12.2.7

Following upgrading to Luminous from Jewel we have been stuck with a
cluster in HEALTH_WARN state that is complaining about large omap objects.
These all seem to be located in our .rgw.buckets.index pool.  We've
disabled auto resharding on bucket indexes due to seeming looping issues
after our upgrade.  We've reduced the number reported of reported large
omap objects by initially increasing the following value:

~# ceph daemon mon.ceph-mon-1 config get
osd_deep_scrub_large_omap_object_value_sum_threshold
{
"osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648"
}

However we're still getting a warning about a single large OMAP object,
however I don't believe this is related to an unsharded index - here's the
log entry:

2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
cluster [WRN] Large omap object found. Object:
15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
(bytes): 4458647149

The object in the logs is the "marker" object, rather than the bucket_id -
I've put some details regarding the bucket here:

https://pastebin.com/hW53kTxL

The bucket limit check shows that the index is sharded, so I think this
might be related to versioning, although I was unable to get confirmation
that the bucket in question has versioning enabled through the aws
cli(snipped debug output below)

2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default',
'content-type': 'application/xml'}
2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
body:
http://s3.amazonaws.com/doc/2006-03-01/;>

After dumping the contents of large omap object mentioned above into a file
it does seem to be a simple listing of the bucket contents, potentially an
old index:

~# wc -l omap_keys
17467251 omap_keys

This is approximately 5 million below the currently reported number of
objects in the bucket.

When running the commands listed here:
http://tracker.ceph.com/issues/34307#note-1

The problematic bucket is listed in the output (along with 72 other
buckets):
"CLIENTBUCKET", "default.294495648.690", "default.5689810.107"

As this tests for bucket_id and marker fields not matching to print out the
information, is the implication here that both of these should match in
order to fully migrate to the new sharded index?

I was able to do a "metadata get" using what appears to be the old index
object ID, which seems to support this (there's a "new_bucket_instance_id"
field, containing a newer "bucket_id" and reshard_status is 2, which seems
to suggest it has completed).

I am able to take the "new_bucket_instance_id" and get additional metadata
about the bucket, each time I do this I get a slightly newer
"new_bucket_instance_id", until it stops suggesting updated indexes.

It's probably worth pointing out that when going through this process the
final "bucket_id" doesn't match the one that I currently get when running
'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
suggests that no further resharding has been done as "reshard_status" = 0
and "new_bucket_instance_id" is blank.  The output is available to view
here:

https://pastebin.com/g1TJfKLU

It would be useful if anyone can offer some clarification on how to proceed
from this situation, identifying and removing any old/stale indexes from
the index pool (if that is the case), as I've not been able to spot
anything in the archives.

If there's any further information that is needed for additional context
please let me know.

Thanks
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intermittent slow/blocked requests on one node

2018-08-22 Thread Chris Martin
Hi ceph-users,

A few weeks ago, I had an OSD node -- ceph02 -- lock up hard with no
indication why. I reset the system and everything came back OK, except
that I now get intermittent warnings about slow/blocked requests from
OSDs on the other nodes, waiting for a "subop" to complete on one of
ceph02's OSDs. Each of these blocked requests will persist for a few
(5-8?) minutes, then complete. (I see this using the admin socket to
"dump_ops_in_flight" and "dump_historic_slow_ops".)

I have tried several things to fix the issue, including rebuilding
ceph02 completely! Wiping and reinstalling the OS, purging and
re-creating OSDs. All disks reporting "OK" for SMART health status.
The only effective intervention has been to mark all of ceph02's OSDs
as "out".

At this point I strongly suspect a hardware/firmware issue. Two
questions for you folks while I dig into that:

1. Any more diagnostics that I should try to troubleshoot the delayed
subops in Ceph? Perhaps identify what is causing the delay?

2. When an OSD is complaining about a slow/blocked request (waiting
for sub ops), do RBD clients actually notice this, or does it appear
to the client that the write has completed?

Thank you! Information about my cluster and example warning messages follow.

Chris Martin

About my cluster: Luminous (12.2.4), 5 nodes, each with 12 OSDs (one
rotary HDD per OSD), and a shared SSD in each node with 24 partitions
for all the RocksDB databases and WALs. Systems are Supermicro
6028R-E1CR12T with RAID controller (LSI SAS 3108) set to JBOD mode.
Deployed with ceph-ansible and using Bluestore. Bonded 10 gbps links
throughout (20 gbps each for for client network and cluster network).

```
HEALTH_WARN 2 slow requests are blocked > 32 sec
REQUEST_SLOW 2 slow requests are blocked > 32 sec
2 ops are blocked > 262.144 sec
osd.2 has blocked requests > 262.144 sec
```


```
{
"description": "osd_op(client.84174831.0:45611220 10.1e0
10:07b8635b:::rbd_data.d091f474b0dc51.6084:head [write
716800~4096] snapc 3c3=[3c3] ondisk+write+known_if_redirected e7305)",
"initiated_at": "2018-08-10 14:21:20.507929",
"age": 317.226205,
"duration": 294.342909,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.84174831",
"client_addr": "10.140.120.206:0/2228066036",
"tid": 45611220
},
"events": [
{
"time": "2018-08-10 14:21:20.507929",
"event": "initiated"
},
{
"time": "2018-08-10 14:21:20.508035",
"event": "queued_for_pg"
},
{
"time": "2018-08-10 14:21:20.508102",
"event": "reached_pg"
},
{
"time": "2018-08-10 14:21:20.508192",
"event": "started"
},
{
"time": "2018-08-10 14:21:20.508331",
"event": "waiting for subops from 12,21,60"
},
{
"time": "2018-08-10 14:21:20.509890",
"event": "op_commit"
},
{
"time": "2018-08-10 14:21:20.509895",
"event": "op_applied"
},
{
"time": "2018-08-10 14:21:20.510475",
"event": "sub_op_commit_rec from 12"
},
{
"time": "2018-08-10 14:21:20.510526",
"event": "sub_op_commit_rec from 21"
},
{
"time": "2018-08-10 14:26:14.850653",
"event": "sub_op_commit_rec from 60"
},
{
"time": "2018-08-10 14:26:14.850728",
"event": "commit_sent"
},
{
"time": "2018-08-10 14:26:14.850838",
"event": "done"
}
]
}
}
```
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
I have tried to modify /usr/lib64/ceph/mgr/balancer/module.py

replace  iteritems () with items(),  but I still got following error

g1:/usr/lib64/ceph/mgr/balancer # ceph balancer status
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib64/ceph/mgr/balancer/module.py", line 297, in handle_command
return (0, json.dumps(s, indent=4), '')
  File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps
**kw).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 201, in encode
chunks = list(chunks)
  File "/usr/lib64/python3.6/json/encoder.py", line 430, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'dict_keys' is not JSON serializable

it seems to me that ceph mgr is complie/written for python 3.6 but
balancer plugin is written for python 2.7... this might be related

https://github.com/ceph/ceph/pull/21446

this might be opensuse building ceph package issue

chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
weird thing is that,

g1:~ # locate /bin/python
/usr/bin/python
/usr/bin/python2
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python3.6
/usr/bin/python3.6m

g1:~ # ls /usr/bin/python* -al
lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python -> python2.7
lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python2 -> python2.7
-rwxr-xr-x 1 root root  6304 May 13 07:41 /usr/bin/python2.7
lrwxrwxrwx 1 root root 9 May 13 08:39 /usr/bin/python3 -> python3.6
-rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6
-rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6m

my default python env is 2.7... so under dict object should have
iteritems   method....

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
Hi,

I am running test on ceph mimic  13.0.2.1874+ge31585919b-lp150.1.2 using
openSUSE-Leap-15.0

when I ran "ceph balancer status", it errors out.

g1:/var/log/ceph # ceph balancer status
Error EIO: Module 'balancer' has experienced an error and cannot handle
commands: 'dict' object has no attribute 'iteritems'

what config need to be done in order to get it work?

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Frequent slow requests

2018-06-19 Thread Chris Taylor




On 2018-06-19 12:17 pm, Frank de Bot (lists) wrote:

Frank (lists) wrote:

Hi,

On a small cluster (3 nodes) I frequently have slow requests. When
dumping the inflight ops from the hanging OSD, it seems it doesn't get 
a

'response' for one of the subops. The events always look like:



I've done some further testing, all slow request are blocked by OSD's 
on

 a single host.  How can I debug this problem further? I can't find any
errors or other strange things on the host with osd's that are 
seemingly

not sending a response to an op.

I don't know if you have already checked, but we usually find a bad 
drive after running 'smartctl - t long' or the OSD node is starting to 
use the swap space because of memory usage.




Regards,

Frank de Bot

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal flushed on osd clean shutdown?

2018-06-13 Thread Chris Dunlop

Excellent news - tks!

On Wed, Jun 13, 2018 at 11:50:15AM +0200, Wido den Hollander wrote:

On 06/13/2018 11:39 AM, Chris Dunlop wrote:

Hi,

Is the osd journal flushed completely on a clean shutdown?

In this case, with Jewel, and FileStore osds, and a "clean shutdown" being:


It is, a Jewel OSD will flush it's journal on a clean shutdown. The
flush-journal is no longer needed.

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Journal flushed on osd clean shutdown?

2018-06-13 Thread Chris Dunlop

Hi,

Is the osd journal flushed completely on a clean shutdown?

In this case, with Jewel, and FileStore osds, and a "clean shutdown" 
being:


systemctl stop ceph-osd@${osd}

I understand it's documented practice to issue a --flush-journal after 
shutting down down an osd if you're intending to do anything with the 
journal, but herein lies the sorry tale...


I've accidentally issued a 'blkdiscard' on a whole SSD device containing 
the journals for multiple osds, rather than for a specific partition as 
intended.


The affected osds themselves continue to work along happily.

I assume the journals are write-only during normal operation, in which 
case it's understandable the osds are oblivious to the underlying 
zeroing of the journals (and partition table!).


The GPT partition table and the individual journal partition types and 
guids etc. have been recreated, so, in theory at least, a clean shutdown 
and restart should be fine *if* the clean shutdown means there's nothing 
in the journal to replay on startup.


I've experimented with one of the affected osds (used for "scatch" 
purposes, so safe to play with), shutting it down and starting it up 
again, and it seems to be happy - somewhat to my surprise. I thought I'd 
have to at least use --mkjournal before it would start up again, to 
reinstate whatever header/signature is used in the journals.


There are other affected osds which hold live data, so I want to be more 
careful there.


One option is to simply kill the affected osds and recreate them, and 
allow the data redundancy to take care of things.


However I'm wondering if things should theoretically be ok if I carefully 
shutdown and restart each of the remaining osds in turn, or am I taking 
some kind of data corruption risk?


Tks,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-22 Thread Chris Sarginson
Hi Caspar,

Sean and I replaced the problematic DC S4600 disks (after all but one had
failed) in our cluster with Samsung SM863a disks.
There was an NDA for new Intel firmware (as mentioned earlier in the thread
by David) but given the problems we were experiencing we moved all Intel
disks to a single failure domain but were unable to get to deploy
additional firmware to test.

The Samsung should fit your requirements.

http://www.samsung.com/semiconductor/minisite/ssd/product/enterprise/sm863a/

Regards
Chris

On Thu, 22 Feb 2018 at 12:50 Caspar Smit <caspars...@supernas.eu> wrote:

> Hi Sean and David,
>
> Do you have any follow ups / news on the Intel DC S4600 case? We are
> looking into this drives to use as DB/WAL devices for a new to be build
> cluster.
>
> Did Intel provide anything (like new firmware) which should fix the issues
> you were having or are these drives still unreliable?
>
> At the moment we are also looking into the Intel DC S3610 as an
> alternative which are a step back in performance but should be very
> reliable.
>
> Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as
> DB/WAL? (Aiming for ~3 DWPD should be sufficient for DB/WAL?)
>
> Kind regards,
> Caspar
>
> 2018-01-12 15:45 GMT+01:00 Sean Redmond <sean.redmo...@gmail.com>:
>
>> Hi David,
>>
>> To follow up on this I had a 4th drive fail (out of 12) and have opted to
>> order the below disks as a replacement, I have an ongoing case with Intel
>> via the supplier - Will report back anything useful - But I am going to
>> avoid the Intel s4600 2TB SSD's for the moment.
>>
>> 1.92TB Samsung SM863a 2.5" Enterprise SSD, SATA3 6Gb/s, 2-bit MLC V-NAND
>>
>> Regards
>> Sean Redmond
>>
>> On Wed, Jan 10, 2018 at 11:08 PM, Sean Redmond <sean.redmo...@gmail.com>
>> wrote:
>>
>>> Hi David,
>>>
>>> Thanks for your email, they are connected inside Dell R730XD (2.5 inch
>>> 24 disk model) in None RAID mode via a perc RAID card.
>>>
>>> The version of ceph is Jewel with kernel 4.13.X and ubuntu 16.04.
>>>
>>> Thanks for your feedback on the HGST disks.
>>>
>>> Thanks
>>>
>>> On Wed, Jan 10, 2018 at 10:55 PM, David Herselman <d...@syrex.co> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>>
>>>>
>>>> No, Intel’s feedback has been… Pathetic… I have yet to receive anything
>>>> more than a request to ‘sign’ a non-disclosure agreement, to obtain beta
>>>> firmware. No official answer as to whether or not one can logically unlock
>>>> the drives, no answer to my question whether or not Intel publish serial
>>>> numbers anywhere pertaining to recalled batches and no information
>>>> pertaining to whether or not firmware updates would address any known
>>>> issues.
>>>>
>>>>
>>>>
>>>> This with us being an accredited Intel Gold partner…
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> We’ve returned the lot and ended up with 9/12 of the drives failing in
>>>> the same manner. The replaced drives, which had different serial number
>>>> ranges, also failed. Very frustrating is that the drives fail in a way that
>>>> result in unbootable servers, unless one adds ‘rootdelay=240’ to the 
>>>> kernel.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I would be interested to know what platform your drives were in and
>>>> whether or not they were connected to a RAID module/card.
>>>>
>>>>
>>>>
>>>> PS: After much searching we’ve decided to order the NVMe conversion kit
>>>> and have ordered HGST UltraStar SN200 2.5 inch SFF drives with a 3 DWPD
>>>> rating.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> David Herselman
>>>>
>>>>
>>>>
>>>> *From:* Sean Redmond [mailto:sean.redmo...@gmail.com]
>>>> *Sent:* Thursday, 11 January 2018 12:45 AM
>>>> *To:* David Herselman <d...@syrex.co>
>>>> *Cc:* Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com
>>>>
>>>> *Subject:* Re: [ceph-users] Many concurrent drive failures - How do I
>>>> activate pgs?
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I have a case where 3 out to 12 of 

[ceph-users] ceph mons de-synced from rest of cluster?

2018-02-11 Thread Chris Apsey

All,

Recently doubled the number of OSDs in our cluster, and towards the end 
of the rebalancing, I noticed that recovery IO fell to nothing and that 
the ceph mons eventually looked like this when I ran ceph -s


  cluster:
id: 6a65c3d0-b84e-4c89-bbf7-a38a1966d780
health: HEALTH_WARN
34922/4329975 objects misplaced (0.807%)
Reduced data availability: 542 pgs inactive, 49 pgs 
peering, 13502 pgs stale
Degraded data redundancy: 248778/4329975 objects 
degraded (5.745%), 7319 pgs unclean, 2224 pgs degraded, 1817 pgs 
undersized


  services:
mon: 3 daemons, quorum cephmon-0,cephmon-1,cephmon-2
mgr: cephmon-0(active), standbys: cephmon-1, cephmon-2
osd: 376 osds: 376 up, 376 in

  data:
pools:   9 pools, 13952 pgs
objects: 1409k objects, 5992 GB
usage:   31528 GB used, 1673 TB / 1704 TB avail
pgs: 3.225% pgs unknown
 0.659% pgs not active
 248778/4329975 objects degraded (5.745%)
 34922/4329975 objects misplaced (0.807%)
 6141 stale+active+clean
 4537 stale+active+remapped+backfilling
 1575 stale+active+undersized+degraded
 489  stale+active+clean+remapped
 450  unknown
 396  stale+active+recovery_wait+degraded
 216  
stale+active+undersized+degraded+remapped+backfilling

 40   stale+peering
 30   stale+activating
 24   stale+active+undersized+remapped
 22   stale+active+recovering+degraded
 13   stale+activating+degraded
 9stale+remapped+peering
 4stale+active+remapped+backfill_wait
 3stale+active+clean+scrubbing+deep
 2
stale+active+undersized+degraded+remapped+backfill_wait

 1stale+active+remapped

The problem is, everything works fine.  If I run ceph health detail and 
do a pg query against one of the 'degraded' placement groups, it reports 
back as active-clean.  All clients in the cluster can write and read at 
normal speeds, but not IO information is ever reported in ceph -s.


From what I can see, everything in the cluster is working properly 
except the actual reporting on the status of the cluster.  Has anyone 
seen this before/know how to sync the mons up to what the OSDs are 
actually reporting?  I see no connectivity errors in the logs of the 
mons or the osds.


Thanks,

---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase recovery / backfilling speed (with many small objects)

2018-01-05 Thread Chris Sarginson
You probably want to consider increasing osd max backfills

You should be able to inject this online

http://docs.ceph.com/docs/luminous/rados/configuration/osd-config-ref/

You might want to drop your osd recovery max active settings back down to
around 2 or 3, although with it being SSD your performance will probably be
fine.


On Fri, 5 Jan 2018 at 20:13 Stefan Kooman  wrote:

> Hi,
>
> I know I'm not the only one with this question as I have see similar
> questions on this list:
> How to speed up recovery / backfilling?
>
> Current status:
>
> pgs: 155325434/800312109 objects degraded (19.408%)
>  1395 active+clean
>  440  active+undersized+degraded+remapped+backfill_wait
>  21   active+undersized+degraded+remapped+backfilling
>
>   io:
> client:   180 kB/s rd, 5776 kB/s wr, 273 op/s rd, 440 op/s wr
> recovery: 2990 kB/s, 109 keys/s, 114 objects/s
>
> What we did? Shutdown one DC. Fill cluster with loads of objects, turn
> DC back on (size = 3, min_size=2). To test exactly this: recovery.
>
> I have been going trough all the recovery options (including legacy) but
> I cannot get the recovery speed to increase:
>
> osd_recovery_op_priority 63
> osd_client_op_priority 3
>
> ^^ yup, reversed those, to no avail
>
> osd_recovery_max_active 10'
>
> ^^ This helped for a short period of time, and then it went back to
> "slow" mode
>
> osd_recovery_max_omap_entries_per_chunk 0
> osd_recovery_max_chunk 67108864
>
> Haven't seen any change in recovery speed.
>
> osd_recovery_sleep_ssd": "0.00
> ^^ default for SSD
>
> The whole cluster is idle, ODSs have very low load. What can be the
> reason for the slow recovery? Something is holding it back but I cannot
> think of what.
>
> Ceph Luminous 12.2.2 (bluestore on lvm, all SSD)
>
> Thanks,
>
> Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688
> <+31%20318%20648%20688> / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switch to replica 3

2017-11-20 Thread Chris Taylor


On 2017-11-20 3:39 am, Matteo Dacrema wrote:

Yes I mean the existing Cluster.
SSDs are on a fully separate pool.
Cluster is not busy during recovery and deep scrubs but I think it’s
better to limit replication in some way when switching to replica 3.

My question is to understand if I need to set some options parameters
to limit the impact of the creation of new objects.I’m also concerned
about disk filling up during recovery because of inefficient data
balancing.


You can try using osd_recovery_sleep to slow down the backfilling so it 
does not cause the client io to hang.


ceph tell osd.* injectargs "--osd_recovery_sleep 0.1"




Here osd tree

ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-10  19.69994 root ssd
-11   5.06998 host ceph101
166   0.98999 osd.166   up  1.0  1.0
167   1.0 osd.167   up  1.0  1.0
168   1.0 osd.168   up  1.0  1.0
169   1.07999 osd.169   up  1.0  1.0
170   1.0 osd.170   up  1.0  1.0
-12   4.92998 host ceph102
171   0.98000 osd.171   up  1.0  1.0
172   0.92999 osd.172   up  1.0  1.0
173   0.98000 osd.173   up  1.0  1.0
174   1.0 osd.174   up  1.0  1.0
175   1.03999 osd.175   up  1.0  1.0
-13   4.69998 host ceph103
176   0.84999 osd.176   up  1.0  1.0
177   0.84999 osd.177   up  1.0  1.0
178   1.0 osd.178   up  1.0  1.0
179   1.0 osd.179   up  1.0  1.0
180   1.0 osd.180   up  1.0  1.0
-14   5.0 host ceph104
181   1.0 osd.181   up  1.0  1.0
182   1.0 osd.182   up  1.0  1.0
183   1.0 osd.183   up  1.0  1.0
184   1.0 osd.184   up  1.0  1.0
185   1.0 osd.185   up  1.0  1.0
 -1 185.19835 root default
 -2  18.39980 host ceph001
 63   0.7 osd.63up  1.0  1.0
 64   0.7 osd.64up  1.0  1.0
 65   0.7 osd.65up  1.0  1.0
146   0.7 osd.146   up  1.0  1.0
147   0.7 osd.147   up  1.0  1.0
148   0.90999 osd.148   up  1.0  1.0
149   0.7 osd.149   up  1.0  1.0
150   0.7 osd.150   up  1.0  1.0
151   0.7 osd.151   up  1.0  1.0
152   0.7 osd.152   up  1.0  1.0
153   0.7 osd.153   up  1.0  1.0
154   0.7 osd.154   up  1.0  1.0
155   0.8 osd.155   up  1.0  1.0
156   0.84999 osd.156   up  1.0  1.0
157   0.7 osd.157   up  1.0  1.0
158   0.7 osd.158   up  1.0  1.0
159   0.84999 osd.159   up  1.0  1.0
160   0.90999 osd.160   up  1.0  1.0
161   0.90999 osd.161   up  1.0  1.0
162   0.90999 osd.162   up  1.0  1.0
163   0.7 osd.163   up  1.0  1.0
164   0.90999 osd.164   up  1.0  1.0
165   0.64999 osd.165   up  1.0  1.0
 -3  19.41982 host ceph002
 23   0.7 osd.23up  1.0  1.0
 24   0.7 osd.24up  1.0  1.0
 25   0.90999 osd.25up  1.0  1.0
 26   0.5 osd.26up  1.0  1.0
 27   0.95000 osd.27up  1.0  1.0
 28   0.64999 osd.28up  1.0  1.0
 29   0.75000 osd.29up  1.0  1.0
 30   0.8 osd.30up  1.0  1.0
 31   0.90999 osd.31up  1.0  1.0
 32   0.90999 osd.32up  1.0  1.0
 33   0.8 osd.33up  1.0  1.0
 34   0.90999 osd.34up  1.0  1.0
 35   0.90999 osd.35up  1.0  1.0
 36   0.84999 osd.36up  1.0  1.0
 37   0.8 osd.37up  1.0  1.0
 38   1.0 osd.38up  1.0  1.0
 39   0.7 osd.39up  1.0  1.0
 40   0.90999 osd.40up  1.0  1.0
 41   0.84999 osd.41up  1.0  1.0
 42   

Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

2017-11-06 Thread Chris Jones
I'll document the resolution here for anyone else who experiences similar 
issues.

We have determined the root cause of the long boot time was a combination of 
factors having to do with ZFS version and tuning, in combination with how long 
filenames are handled.

## 1 ## Insufficient ARC cache size.

Dramatically increasing the arc_max and arc_meta_limit allowed better 
performance once the cache had time to populate. Previously, each call to 
getxattr took about 8ms (0.008 sec). Multiply that by millions of getxattr 
calls during OSD daemon startup, this was taking hours. This only became 
apparent when we upgraded to Jewel. Hammer does not appear to parse all of the 
extended attributes during startup; This appeared to be introduced in Jewel as 
part of the sortbitwise algorithm.

Increasing the arc_max and arc_meta_limit allowed more of the meta data to be 
cached in memory. This reduced getxattr call duration to between 10 to 100 
microseconds (0.0001 to 0.1 sec). An average of around 400x faster.

## 2 ## ZFS version 0.6.5.11 and inability to store large amounts of meta info 
in the inode/dnode.

My understanding is that the ability to use a larger dnode size to store meta 
was not introduced until ZFS version 0.7.x. In version 0.6.5.11 this was 
causing large quantities of meta data to be stored in inefficient spill blocks, 
which were taking longer to access since they were not cached due to 
(previously) undersized ARC settings.

## Summary ##

Increasing ARC cache settings improved performance, but performance will still 
be a concern if the ARC is purged/flushed, such during system reboot, until the 
cache rebuilds itself.

Upgrading to ZFS version 0.7.x is one potential upgrade path to utilize larger 
dnode size. Another upgrade path is to switch to XFS, which is the recommended 
filesystem for CEPH. XFS does not appear to require any kind of meta cache due 
to different handling of meta info in the inode.



--
Chris


From: Willem Jan Withagen <w...@digiware.nl>
Sent: Wednesday, November 1, 2017 4:51:52 PM
To: Chris Jones; Gregory Farnum
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

On 01/11/2017 18:04, Chris Jones wrote:
> Greg,
>
> Thanks so much for the reply!
>
> We are not clear on why ZFS is behaving poorly under some circumstances
> on getxattr system calls, but that appears to be the case.
>
> Since the last update we have discovered that back-to-back booting of
> the OSD yields very fast boot time, and very fast getxattr system calls.
>
> A longer period between boots (or perhaps related to influx of new data)
> correlates to longer boot duration. This is due to slow getxattr calls
> of certain types.
>
> We suspect this may be a caching or fragmentation issue with ZFS for
> xattrs. Use of longer filenames appear to make this worse.

As far as I understand is a lot of this data stored in the metadata.
Which is (or can be) a different set in the (l2)arc cache.

So are you talking about a OSD reboot, or a system reboot?
Don't quite understand what you mean back-to-back...

I have little experience with ZFS on Linux.
So if behaviour there is different is hard for me to tell.

IF you are rebooting the OSD, I can imagine that with certain sequences
of rebooting pre-loads the meta-cache. Reboots further apart can have
lead to a different working set in the ZFS-caches. And then all data
needs to be refetched, instead of getting it from l2arc.

And note that in newer ZFS versions the in memory ARC even can be
compressed, leading to an even higher hit rate.

For example on my development server with 32Gb memory:
ARC: 20G Total, 1905M MFU, 16G MRU, 70K Anon, 557M Header, 1709M Other
  17G Compressed, 42G Uncompressed, 2.49:1 Ratio

--WjW
>
> We experimented on some OSDs with swapping over to XFS as the
> filesystem, and the problem does not appear to be present on those OSDs.
>
> The two examples below are representative of a Long Boot (longer running
> time and more data influx between osd rebooting) and a Short Boot where
> we booted the same OSD back to back.
>
> Notice the drastic difference in time on the getxattr that yields the
> ENODATA return. Around 0.009 secs for "long boot" and "0.0002" secs when
> the same OSD is booted back to back. Long boot time is approx 40x to 50x
> longer. Multiplied by thousands of getxattr calls, this is/was our
> source of longer boot time.
>
> We are considering a full switch to XFS, but would love to hear any ZFS
> tuning tips that might be a short term workaround.
>
> We are using ZFS 6.5.11 prior to implementation of the ability to use
> large dnodes which would allow the use o

Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

2017-11-01 Thread Chris Jones
Greg,

Thanks so much for the reply!

We are not clear on why ZFS is behaving poorly under some circumstances on 
getxattr system calls, but that appears to be the case.

Since the last update we have discovered that back-to-back booting of the OSD 
yields very fast boot time, and very fast getxattr system calls.

A longer period between boots (or perhaps related to influx of new data) 
correlates to longer boot duration. This is due to slow getxattr calls of 
certain types.

We suspect this may be a caching or fragmentation issue with ZFS for xattrs. 
Use of longer filenames appear to make this worse.

We experimented on some OSDs with swapping over to XFS as the filesystem, and 
the problem does not appear to be present on those OSDs.

The two examples below are representative of a Long Boot (longer running time 
and more data influx between osd rebooting) and a Short Boot where we booted 
the same OSD back to back.

Notice the drastic difference in time on the getxattr that yields the ENODATA 
return. Around 0.009 secs for "long boot" and "0.0002" secs when the same OSD 
is booted back to back. Long boot time is approx 40x to 50x longer. Multiplied 
by thousands of getxattr calls, this is/was our source of longer boot time.

We are considering a full switch to XFS, but would love to hear any ZFS tuning 
tips that might be a short term workaround.

We are using ZFS 6.5.11 prior to implementation of the ability to use large 
dnodes which would allow the use of dnodesize=auto.

#Long Boot
<0.44>[pid 3413902] 13:08:00.884238 
getxattr("/osd/9/current/20.86bs3_head/default.34597.7\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebana_1d9e1e82d623f49c994f_0_long",
 "user.cephos.lfn3", 
"default.34597.7\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-92d9df789f9aaf007c50c50bb66e70af__head_0177C86B__14__3",
 1024) = 616 <0.44>
<0.008875>[pid 3413902] 13:08:00.884476 
getxattr("/osd/9/current/20.86bs3_head/default.34597.57\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_79a7acf2d32f4302a1a4_0_long",
 "user.cephos.lfn3-alt", 0x7f849bf95180, 1024) = -1 ENODATA (No data available) 
<0.008875>

#Short Boot
<0.15> [pid 3452111] 13:37:18.604442 
getxattr("/osd/9/current/20.15c2s3_head/default.34597.22\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_efb8ca13c57689d76797_0_long",
 "user.cephos.lfn3", 
"default.34597.22\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-b519f8607a3d9de0f815d18b6905b27d__head_9726F5C2__14__3",
 1024) = 617 <0.15>
<0.18> [pid 3452111] 13:37:18.604546 
getxattr("/osd/9/current/20.15c2s3_head/default.34597.66\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_0e6d86f58e03d0f6de04_0_long",
 "user.cephos.lfn3-alt", 0x7fd4e8017680, 1024) = -1 ENODATA (No data available) 
<0.18>



------
Christopher J. Jones



From: Gregory Farnum <gfar...@redhat.com>
Sent: Monday, October 30, 2017 6:20:15 PM
To: Chris Jones
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

On Thu, Oct 26, 2017 at 11:33 AM Chris Jones 
<chris.jo...@ctl.io<mailto:chris.jo...@ctl.io&

Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

2017-10-26 Thread Chris Jones
The long running functionality appears to be related to clear_temp_objects(); 
from OSD.cc called from init().


What is this functionality intended to do? Is it required to be run on every 
OSD startup? Any configuration settings that would help speed this up?


--
Christopher J. Jones


From: Chris Jones
Sent: Wednesday, October 25, 2017 12:52:13 PM
To: ceph-users@lists.ceph.com
Subject: Hammer to Jewel Upgrade - Extreme OSD Boot Time


After upgrading from CEPH Hammer to Jewel, we are experiencing extremely long 
osd boot duration.

This long boot time is a huge concern for us and are looking for insight into 
how we can speed up the boot time.

In Hammer, OSD boot time was approx 3 minutes. After upgrading to Jewel, boot 
time is between 1 and 3 HOURS.

This was not surprising during initial boot after the upgrade, however we are 
seeing this occur each time an OSD process is restarted.

This is using ZFS. We added the following configuration to ceph.conf as part of 
the upgrade to overcome some filesystem startup issues per the recommendations 
at the following url:

https://github.com/zfsonlinux/zfs/issues/4913

Added ceph.conf configuration:
filestore_max_inline_xattrs = 10
filestore_max_inline_xattr_size = 65536
filestore_max_xattr_value_size = 65536


Example OSD Log (note the long duration at the line containing "osd.191 119292 
crush map has features 281819681652736, adjusting msgr requires for osds":

2017-10-24 18:01:18.410249 7f1333d08700  1 leveldb: Generated table #524178: 
158056 keys, 1502244 bytes
2017-10-24 18:01:18.805235 7f1333d08700  1 leveldb: Generated table #524179: 
266429 keys, 2129196 bytes
2017-10-24 18:01:19.254798 7f1333d08700  1 leveldb: Generated table #524180: 
197068 keys, 2128820 bytes
2017-10-24 18:01:20.070109 7f1333d08700  1 leveldb: Generated table #524181: 
192675 keys, 2129122 bytes
2017-10-24 18:01:20.947818 7f1333d08700  1 leveldb: Generated table #524182: 
196806 keys, 2128945 bytes
2017-10-24 18:01:21.183475 7f1333d08700  1 leveldb: Generated table #524183: 
63421 keys, 828081 bytes
2017-10-24 18:01:21.477197 7f1333d08700  1 leveldb: Generated table #524184: 
173331 keys, 1348407 bytes
2017-10-24 18:01:21.477226 7f1333d08700  1 leveldb: Compacted 1@2 + 12@3 files 
=> 19838392 bytes
2017-10-24 18:01:21.509952 7f1333d08700  1 leveldb: compacted to: files[ 0 1 66 
551 788 0 0 ]
2017-10-24 18:01:21.512235 7f1333d08700  1 leveldb: Delete type=2 #523994
2017-10-24 18:01:23.142853 7f1349d93800  0 filestore(/osd/191) mount: enabling 
WRITEAHEAD journal mode: checkpoint is not enabled
2017-10-24 18:01:27.927823 7f1349d93800  0  cls/hello/cls_hello.cc:305: 
loading cls_hello
2017-10-24 18:01:27.933105 7f1349d93800  0  cls/cephfs/cls_cephfs.cc:202: 
loading cephfs_size_scan
2017-10-24 18:01:27.960283 7f1349d93800  0 osd.191 119292 crush map has 
features 281544803745792, adjusting msgr requires for clients
2017-10-24 18:01:27.960309 7f1349d93800  0 osd.191 119292 crush map has 
features 281819681652736 was 8705, adjusting msgr requires for mons
2017-10-24 18:01:27.960316 7f1349d93800  0 osd.191 119292 crush map has 
features 281819681652736, adjusting msgr requires for osds
2017-10-24 23:28:09.694213 7f1349d93800  0 osd.191 119292 load_pgs
2017-10-24 23:28:14.757449 7f1333d08700  1 leveldb: Compacting 1@1 + 13@2 files
2017-10-24 23:28:15.002381 7f1333d08700  1 leveldb: Generated table #524185: 
17970 keys, 2128900 bytes
2017-10-24 23:28:15.198899 7f1333d08700  1 leveldb: Generated table #524186: 
22386 keys, 2128610 bytes
2017-10-24 23:28:15.337819 7f1333d08700  1 leveldb: Generated table #524187: 
3890 keys, 371799 bytes
2017-10-24 23:28:15.693433 7f1333d08700  1 leveldb: Generated table #524188: 
21984 keys, 2128947 bytes
2017-10-24 23:28:15.874955 7f1333d08700  1 leveldb: Generated table #524189: 
9565 keys, 1207375 bytes
2017-10-24 23:28:16.253599 7f1333d08700  1 leveldb: Generated table #524190: 
21999 keys, 2129625 bytes
2017-10-24 23:28:16.576250 7f1333d08700  1 leveldb: Generated table #524191: 
21544 keys, 2128033 bytes


Strace on an OSD process during startup reveals what appears to be parsing of 
objects and calling getxattr.

The bulk of the time is spent on parsing the objects and performing the 
getxattr system calls... for example:

(Full lines truncated intentionally for brevity).
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.13\...(ommitted)
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted)
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted)

Cluster details:
- 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 
drives per host, collocated journals)
- Pre-upgrade: Hammer (ceph version 0.94.6)
- Post-upgrade: Jewel (ceph version 10.2.9)
- object storage use only
- erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- fail

[ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

2017-10-25 Thread Chris Jones
After upgrading from CEPH Hammer to Jewel, we are experiencing extremely long 
osd boot duration.

This long boot time is a huge concern for us and are looking for insight into 
how we can speed up the boot time.

In Hammer, OSD boot time was approx 3 minutes. After upgrading to Jewel, boot 
time is between 1 and 3 HOURS.

This was not surprising during initial boot after the upgrade, however we are 
seeing this occur each time an OSD process is restarted.

This is using ZFS. We added the following configuration to ceph.conf as part of 
the upgrade to overcome some filesystem startup issues per the recommendations 
at the following url:

https://github.com/zfsonlinux/zfs/issues/4913

Added ceph.conf configuration:
filestore_max_inline_xattrs = 10
filestore_max_inline_xattr_size = 65536
filestore_max_xattr_value_size = 65536


Example OSD Log (note the long duration at the line containing "osd.191 119292 
crush map has features 281819681652736, adjusting msgr requires for osds":

2017-10-24 18:01:18.410249 7f1333d08700  1 leveldb: Generated table #524178: 
158056 keys, 1502244 bytes
2017-10-24 18:01:18.805235 7f1333d08700  1 leveldb: Generated table #524179: 
266429 keys, 2129196 bytes
2017-10-24 18:01:19.254798 7f1333d08700  1 leveldb: Generated table #524180: 
197068 keys, 2128820 bytes
2017-10-24 18:01:20.070109 7f1333d08700  1 leveldb: Generated table #524181: 
192675 keys, 2129122 bytes
2017-10-24 18:01:20.947818 7f1333d08700  1 leveldb: Generated table #524182: 
196806 keys, 2128945 bytes
2017-10-24 18:01:21.183475 7f1333d08700  1 leveldb: Generated table #524183: 
63421 keys, 828081 bytes
2017-10-24 18:01:21.477197 7f1333d08700  1 leveldb: Generated table #524184: 
173331 keys, 1348407 bytes
2017-10-24 18:01:21.477226 7f1333d08700  1 leveldb: Compacted 1@2 + 12@3 files 
=> 19838392 bytes
2017-10-24 18:01:21.509952 7f1333d08700  1 leveldb: compacted to: files[ 0 1 66 
551 788 0 0 ]
2017-10-24 18:01:21.512235 7f1333d08700  1 leveldb: Delete type=2 #523994
2017-10-24 18:01:23.142853 7f1349d93800  0 filestore(/osd/191) mount: enabling 
WRITEAHEAD journal mode: checkpoint is not enabled
2017-10-24 18:01:27.927823 7f1349d93800  0  cls/hello/cls_hello.cc:305: 
loading cls_hello
2017-10-24 18:01:27.933105 7f1349d93800  0  cls/cephfs/cls_cephfs.cc:202: 
loading cephfs_size_scan
2017-10-24 18:01:27.960283 7f1349d93800  0 osd.191 119292 crush map has 
features 281544803745792, adjusting msgr requires for clients
2017-10-24 18:01:27.960309 7f1349d93800  0 osd.191 119292 crush map has 
features 281819681652736 was 8705, adjusting msgr requires for mons
2017-10-24 18:01:27.960316 7f1349d93800  0 osd.191 119292 crush map has 
features 281819681652736, adjusting msgr requires for osds
2017-10-24 23:28:09.694213 7f1349d93800  0 osd.191 119292 load_pgs
2017-10-24 23:28:14.757449 7f1333d08700  1 leveldb: Compacting 1@1 + 13@2 files
2017-10-24 23:28:15.002381 7f1333d08700  1 leveldb: Generated table #524185: 
17970 keys, 2128900 bytes
2017-10-24 23:28:15.198899 7f1333d08700  1 leveldb: Generated table #524186: 
22386 keys, 2128610 bytes
2017-10-24 23:28:15.337819 7f1333d08700  1 leveldb: Generated table #524187: 
3890 keys, 371799 bytes
2017-10-24 23:28:15.693433 7f1333d08700  1 leveldb: Generated table #524188: 
21984 keys, 2128947 bytes
2017-10-24 23:28:15.874955 7f1333d08700  1 leveldb: Generated table #524189: 
9565 keys, 1207375 bytes
2017-10-24 23:28:16.253599 7f1333d08700  1 leveldb: Generated table #524190: 
21999 keys, 2129625 bytes
2017-10-24 23:28:16.576250 7f1333d08700  1 leveldb: Generated table #524191: 
21544 keys, 2128033 bytes


Strace on an OSD process during startup reveals what appears to be parsing of 
objects and calling getxattr.

The bulk of the time is spent on parsing the objects and performing the 
getxattr system calls... for example:

(Full lines truncated intentionally for brevity).
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.13\...(ommitted)
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted)
[pid 3068964] 
getxattr("/osd/174/current/20.6a4s7_head/default.7385.5\...(ommitted)

Cluster details:
- 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04 3.16.0-77-generic, 72 6TB SAS2 
drives per host, collocated journals)
- Pre-upgrade: Hammer (ceph version 0.94.6)
- Post-upgrade: Jewel (ceph version 10.2.9)
- object storage use only
- erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
- failure domain of host
- cluster is currently storing approx 500TB over 200 MObjects



--
Christopher J. Jones

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] remove require_jewel_osds flag after upgrade to kraken

2017-07-13 Thread Chris Sarginson
The flag is fine, it's just to ensure that OSDs from a release before Jewel
can't be added to the cluster:

See http://ceph.com/geen-categorie/v10-2-4-jewel-released/ under "Upgrading
from hammer"

On Thu, 13 Jul 2017 at 07:59 Jan Krcmar  wrote:

> hi,
>
> is it possible to remove the require_jewel_osds flag after upgrade to
> kraken?
>
> $ ceph osd stat
>  osdmap e29021: 40 osds: 40 up, 40 in
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>
> it seems that ceph osd unset does not support require_jewel_osds
>
> $ ceph osd unset require_jewel_osds
> Invalid command:  require_jewel_osds not in
>
> full|pause|noup|nodown|noout|noin|nobackfill|norebalance|norecover|noscrub|nodeep-scrub|notieragent|sortbitwise
> osd unset
> full|pause|noup|nodown|noout|noin|nobackfill|norebalance|norecover|noscrub|nodeep-scrub|notieragent|sortbitwise
> :  unset 
> Error EINVAL: invalid command
>
> is there any way to remove it?
> if not, is it ok to leave the flag there?
>
> thanks
> fous
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] autoconfigured haproxy service?

2017-07-12 Thread Chris Jones
Hi Sage,

The automated tool Cepheus https://github.com/cepheus-io/cepheus does this
with ceph-chef. It's based on json data for a given environment. It uses
Chef and Ansible. If someone wanted to break out the haproxy (ADC) portion
into a package then it has a good model for HAProxy they could look at.
Originally created due to the need for our own software LB solution over a
hardware LB. It also supports keep-alived and bird (BGP).

Thanks

On Tue, Jul 11, 2017 at 11:03 AM, Sage Weil  wrote:

> Hi all,
>
> Luminous features a new 'service map' that lets rgw's (and rgw nfs
> gateways and iscsi gateways and rbd mirror daemons and ...) advertise
> themselves to the cluster along with some metadata (like the addresses
> they are binding to and the services the provide).
>
> It should be pretty straightforward to build a service that
> auto-configures haproxy based on this information so that you can deploy
> an rgw front-end that dynamically reconfigures itself when additional
> rgw's are deployed or removed.  haproxy has a facility to adjust its
> backend configuration at runtime[1].
>
> Anybody interested in tackling this?  Setting up the load balancer in
> front of rgw is one of the more annoying pieces of getting ceph up and
> running in production and until now has been mostly treated as out of
> scope.  It would be awesome if there was an autoconfigured service that
> did it out of the box (and had all the right haproxy options set).
>
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osdmap several thousand epochs behind latest

2017-07-09 Thread Chris Apsey

All,

Had a fairly substantial network interruption that knocked out about 
~270 osds:


 health HEALTH_ERR
[...]
273/384 in osds are down
noup,nodown,noout flag(s) set
 monmap e2: 3 mons at 
{cephmon-0=10.10.6.0:6789/0,cephmon-1=10.10.6.1:6789/0,cephmon-2=10.10.6.2:6789/0}
election epoch 138, quorum 0,1,2 
cephmon-0,cephmon-1,cephmon-2

mgr no daemons active
 osdmap e37718: 384 osds: 111 up, 384 in; 16764 remapped pgs
flags 
noup,nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds


We've had network interruptions before, and normally OSDs come back on 
their own, or do so with a service restart.  This time, no such luck 
(I'm guessing the scale was just too much).  After a few hours of trying 
to figure out why OSD services were running on the hosts (according to 
systemd) but marked 'down' in ceph osd tree, I found this thread: 
http://ceph-devel.vger.kernel.narkive.com/ftEN7TOU/70-osd-are-down-and-not-coming-up 
which appears to perfectly describe the scenario (high CPU usage, osdmap 
way out of sync, etc.)


I've taken the steps outlined and set the appropriate flags and am 
monitoring the 'catch up' progress of the OSDs.  The OSD farthest behind 
is about 5000 epochs out of sync, so I assume it will be a few hours 
before I see CPU usage level out.


Once the OSDs are caught up, are there any other steps I should take 
before 'ceph osd unset noup' (or anything to do after)?


Thanks in advance,

--
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-26 Thread Chris Apsey

Adam,

Before we deployed our cluster, we did extensive testing on all kinds of 
SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME 
Drives.  We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB 
to 12x HGST 10TB SAS3 HDDs.  It provided the best 
price/performance/density balance for us overall.  As a frame of 
reference, we have 384 OSDs spread across 16 nodes.


A few (anecdotal) notes:

1. Consumer SSDs have unpredictable performance under load; write 
latency can go from normal to unusable with almost no warning.  
Enterprise drives generally show much less load sensitivity.
2. Write endurance; while it may appear that having several 
consumer-grade SSDs backing a smaller number of OSDs will yield better 
longevity than an enterprise grade SSD backing a larger number of OSDs, 
the reality is that enterprise drives that use SLC or eMLC are generally 
an order of magnitude more reliable when all is said and done.
3. Power Loss protection (PLP).  Consumer drives generally don't do well 
when power is suddenly lost.  Yes, we should all have UPS, etc., but 
things happen.  Enterprise drives are much more tolerant of 
environmental failures.  Recovering from misplaced objects while also 
attempting to serve clients is no fun.






---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2017-04-26 10:53, Adam Carheden wrote:

What I'm trying to get from the list is /why/ the "enterprise" drives
are important. Performance? Reliability? Something else?

The Intel was the only one I was seriously considering. The others were
just ones I had for other purposes, so I thought I'd see how they fared
in benchmarks.

The Intel was the clear winner, but my tests did show that throughput
tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
it, all with osd op threads = 2, do the benchmarks below not show that
the Hynix would be a better choice (at least for performance)?

Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
the single drive leaves more bays free for OSD disks, but is there any
other reason a single S3610 is preferable to 4 S3520s? Wouldn't 
4xS3520s

mean:

a) fewer OSDs go down if the SSD fails

b) better throughput (I'm speculating that the S3610 isn't 4 times
faster than the S3520)

c) load spread across 4 SATA channels (I suppose this doesn't really
matter since the drives can't throttle the SATA bus).


--
Adam Carheden

On 04/26/2017 01:55 AM, Eneko Lacunza wrote:

Adam,

What David said before about SSD drives is very important. I will tell
you another way: use enterprise grade SSD drives, not consumer grade.
Also, pay attention to endurance.

The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7,
and probably it isn't even the most suitable SATA SSD disk from Intel;
better use S3610 o S3710 series.

Cheers
Eneko

El 25/04/17 a las 21:02, Adam Carheden escribió:

On 04/25/2017 11:57 AM, David wrote:

On 19 Apr 2017 18:01, "Adam Carheden" <carhe...@ucar.edu
<mailto:carhe...@ucar.edu>> wrote:

 Does anyone know if XFS uses a single thread to write to it's
journal?


You probably know this but just to avoid any confusion, the journal 
in

this context isn't the metadata journaling in XFS, it's a separate
journal written to by the OSD daemons

Ha! I didn't know that.


I think the number of threads per OSD is controlled by the 'osd op
threads' setting which defaults to 2

So the ideal (for performance) CEPH cluster would be one SSD per HDD
with 'osd op threads' set to whatever value fio shows as the optimal
number of threads for that drive then?

I would avoid the SanDisk and Hynix. The s3500 isn't too bad. 
Perhaps
consider going up to a 37xx and putting more OSDs on it. Of course 
with

the caveat that you'll lose more OSDs if it goes down.

Why would you avoid the SanDisk and Hynix? Reliability (I think those
two are both TLC)? Brand trust? If it's my benchmarks in my previous
email, why not the Hynix? It's slower than the Intel, but sort of
decent, at lease compared to the SanDisk.

My final numbers are below, including an older Samsung Evo (MCL I 
think)
which did horribly, though not as bad as the SanDisk. The Seagate is 
a
10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD 
comparison.


  SanDisk SDSSDA240G, fio  1 jobs:   7.0 MB/s (5 trials)


  SanDisk SDSSDA240G, fio  2 jobs:   7.6 MB/s (5 trials)


  SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s (5 trials)


  SanDisk SDSSDA240G, fio  8 jobs:   7.6 MB/s (5 trials)


  SanDisk SDSSDA240G, fio 16 jobs:   7.6 MB/s (5 trials)


  SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s (5 trials)


  SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.2 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  2 jobs:   0.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  4 jobs:   7.5 MB/s (5 tri

Re: [ceph-users] Creating journal on needed partition

2017-04-17 Thread Chris Apsey

Nikita,

Take a look at 
https://git.cybbh.space/vta/saltstack/tree/master/apps/ceph


Particularly files/init-journal.sh and files/osd-bootstrap.sh

We use salt to do some of the legwork (templatizing the bootstrap 
process), but for the most part it is all just a bunch of shell scripts 
with some control flow.  We partition an nvme device and then create 
symlinks from osds to the partitions in a per-determined fashion.  We 
don't use ceph-desk at all.


---
v/r

Chris Apsey
bitskr...@bitskrieg.net
https://www.bitskrieg.net

On 2017-04-17 08:56, Nikita Shalnov wrote:

Hi all.

Is there any way to create osd manually, which would use a designated
partition of the journal disk (without using ceph-ansible)?

I have journals on SSD disks nad each journal disk contains 3
partitions for 3 osds.

Example: one of the osds crashed. I changed a disk (sdaa) and want to
prepare the disk for adding to the cluster. Here is the journal, that
should be used by new osd:

/dev/sdaf :

/dev/sdaf2 ceph journal

/dev/sdaf3 ceph journal, for /dev/sdab1

/dev/sdaf1 ceph journal, for /dev/sdz1

You see, that the bad disk used a second partition. If I run CEPH-DISK
PREPARE  /DEV/SDAA /DEV/SDAF, ceph-disk creates /dev/sdaf4 partition
and sets it as journal disk for the osd. But I want to use second
empty partition (/dev/sdaf2). If I delete /dev/sdaf2 partition, the
behavior doesn't change.

Can someone help me?

BR,

Nikita


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] saving file on cephFS mount using vi takes pause/time

2017-04-13 Thread Chris Sarginson
Is it related to this the recovery behaviour of vim creating a swap file,
which I think nano does not do?

http://vimdoc.sourceforge.net/htmldoc/recover.html

A sync into cephfs I think needs the write to get confirmed all the way
down from the osds performing the write before it returns the confirmation
to the client calling the sync, though I stand to be corrected on that.

On Thu, 13 Apr 2017 at 22:04 Deepak Naidu  wrote:

> Ok, I tried strace to check why vi slows or pauses. It seems to slow on
> *fsync(3)*
>
>
>
> I didn’t see the issue with nano editor.
>
>
>
> --
>
> Deepak
>
>
>
>
>
> *From:* Deepak Naidu
> *Sent:* Wednesday, April 12, 2017 2:18 PM
> *To:* 'ceph-users'
> *Subject:* saving file on cephFS mount using vi takes pause/time
>
>
>
> Folks,
>
>
>
> This is bit weird issue. I am using the cephFS volume to read write files
> etc its quick less than seconds. But when editing a the file on cephFS
> volume using vi , when saving the file the save takes couple of seconds
> something like sync(flush). The same doesn’t happen on local filesystem.
>
>
>
> Any pointers is appreciated.
>
>
>
> --
>
> Deepak
> --
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-02 Thread Heller, Chris
Success! There was an issue related to my operating system install procedure 
that was causing the journals to become corrupt, but it was not caused by ceph! 
That bug fixed; now the procedure on shutdown in this thread has been verified 
to work as expected. Thanks for all the help.

-Chris

> On Mar 1, 2017, at 9:39 AM, Peter Maloney 
> <peter.malo...@brockmann-consult.de> wrote:
> 
> On 03/01/17 15:36, Heller, Chris wrote:
>> I see. My journal is specified in ceph.conf. I'm not removing it from the 
>> OSD so sounds like flushing isn't needed in my case.
>> 
> Okay but it seems it's not right if it's saying it's a non-block journal. 
> (meaning a file, not a block device).
> 
> Double check your ceph.conf... make sure the path works, and somehow make 
> sure the [osd.x] actually matches that osd (no idea how to test that, esp. if 
> the osd doesn't start ... maybe just increase logging).
> 
> Or just make a symlink for now, just to see if it solves the problem, which 
> would imply the ceph.conf is wrong.
> 
> 
>> -Chris
>>> On Mar 1, 2017, at 9:31 AM, Peter Maloney 
>>> <peter.malo...@brockmann-consult.de 
>>> <mailto:peter.malo...@brockmann-consult.de>> wrote:
>>> 
>>> On 03/01/17 14:41, Heller, Chris wrote:
>>>> That is a good question, and I'm not sure how to answer. The journal is on 
>>>> its own volume, and is not a symlink. Also how does one flush the journal? 
>>>> That seems like an important step when bringing down a cluster safely.
>>>> 
>>> You only need to flush the journal if you are removing it from the osd, 
>>> replacing it with a different journal.
>>> 
>>> So since your journal is on its own, then you need either a symlink in the 
>>> osd directory named "journal" which points to the device (ideally not 
>>> /dev/sdx but /dev/disk/by-.../), or you put it in the ceph.conf.
>>> 
>>> And since it said you have a non-block journal now, it probably means there 
>>> is a file... you should remove that (rename it to journal.junk until you're 
>>> sure it's not an important file, and delete it later).
>>>> 
>>>>>> This is where I've stopped. All but one OSD came back online. One has 
>>>>>> this backtrace:
>>>>>> 
>>>>>> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: 
>>>>>> disabling aio for non-block journal.  Use journal_force_aio to force use 
>>>>>> of aio anyway
>>>>> Are the journals inline? or separate? If they're separate, the above 
>>>>> means the journal symlink/config is missing, so it would possibly make a 
>>>>> new journal, which would be bad if you didn't flush the old journal 
>>>>> before.
>>>>> 
>>>>> And also just one osd is easy enough to replace (which I wouldn't do 
>>>>> until the cluster settled down and recovered). So it's lame for it to be 
>>>>> broken, but it's still recoverable if that's the only issue.
>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> 
> 
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.malo...@brockmann-consult.de 
> <mailto:peter.malo...@brockmann-consult.de>
> Internet: http://www.brockmann-consult.de 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.brockmann-2Dconsult.de=DwMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=fXi7JtWroHrS8RV824OLTqf8NbD_NERvG8hvrPFmUAA=lga4HYFhA45fm1KJHyov1htPfqKhBHZsNFVkt3bTJx0=>
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
I see. My journal is specified in ceph.conf. I'm not removing it from the OSD 
so sounds like flushing isn't needed in my case.

-Chris
> On Mar 1, 2017, at 9:31 AM, Peter Maloney 
> <peter.malo...@brockmann-consult.de> wrote:
> 
> On 03/01/17 14:41, Heller, Chris wrote:
>> That is a good question, and I'm not sure how to answer. The journal is on 
>> its own volume, and is not a symlink. Also how does one flush the journal? 
>> That seems like an important step when bringing down a cluster safely.
>> 
> You only need to flush the journal if you are removing it from the osd, 
> replacing it with a different journal.
> 
> So since your journal is on its own, then you need either a symlink in the 
> osd directory named "journal" which points to the device (ideally not 
> /dev/sdx but /dev/disk/by-.../), or you put it in the ceph.conf.
> 
> And since it said you have a non-block journal now, it probably means there 
> is a file... you should remove that (rename it to journal.junk until you're 
> sure it's not an important file, and delete it later).
> 
>> -Chris
>> 
>>> On Mar 1, 2017, at 8:37 AM, Peter Maloney 
>>> <peter.malo...@brockmann-consult.de 
>>> <mailto:peter.malo...@brockmann-consult.de>> wrote:
>>> 
>>> On 02/28/17 18:55, Heller, Chris wrote:
>>>> Quick update. So I'm trying out the procedure as documented here.
>>>> 
>>>> So far I've:
>>>> 
>>>> 1. Stopped ceph-mds
>>>> 2. set noout, norecover, norebalance, nobackfill
>>>> 3. Stopped all ceph-osd
>>>> 4. Stopped ceph-mon
>>>> 5. Installed new OS
>>>> 6. Started ceph-mon
>>>> 7. Started all ceph-osd
>>>> 
>>>> This is where I've stopped. All but one OSD came back online. One has this 
>>>> backtrace:
>>>> 
>>>> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: 
>>>> disabling aio for non-block journal.  Use journal_force_aio to force use 
>>>> of aio anyway
>>> Are the journals inline? or separate? If they're separate, the above means 
>>> the journal symlink/config is missing, so it would possibly make a new 
>>> journal, which would be bad if you didn't flush the old journal before.
>>> 
>>> And also just one osd is easy enough to replace (which I wouldn't do until 
>>> the cluster settled down and recovered). So it's lame for it to be broken, 
>>> but it's still recoverable if that's the only issue.
>> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
That is a good question, and I'm not sure how to answer. The journal is on its 
own volume, and is not a symlink. Also how does one flush the journal? That 
seems like an important step when bringing down a cluster safely.

-Chris

> On Mar 1, 2017, at 8:37 AM, Peter Maloney 
> <peter.malo...@brockmann-consult.de> wrote:
> 
> On 02/28/17 18:55, Heller, Chris wrote:
>> Quick update. So I'm trying out the procedure as documented here.
>> 
>> So far I've:
>> 
>> 1. Stopped ceph-mds
>> 2. set noout, norecover, norebalance, nobackfill
>> 3. Stopped all ceph-osd
>> 4. Stopped ceph-mon
>> 5. Installed new OS
>> 6. Started ceph-mon
>> 7. Started all ceph-osd
>> 
>> This is where I've stopped. All but one OSD came back online. One has this 
>> backtrace:
>> 
>> 2017-02-28 17:44:54.884235 7fb2ba3187c0 -1 journal FileJournal::_open: 
>> disabling aio for non-block journal.  Use journal_force_aio to force use of 
>> aio anyway
> Are the journals inline? or separate? If they're separate, the above means 
> the journal symlink/config is missing, so it would possibly make a new 
> journal, which would be bad if you didn't flush the old journal before.
> 
> And also just one osd is easy enough to replace (which I wouldn't do until 
> the cluster settled down and recovered). So it's lame for it to be broken, 
> but it's still recoverable if that's the only issue.



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Antw: Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
In my case the version will be identical. But I might have to do this node by 
node approach if I can't stabilize the more general shutdown/bring-up approach. 
There are 192 OSD in my cluster, so it will take a while to go node by node 
unfortunately.

-Chris

> On Mar 1, 2017, at 2:50 AM, Steffen Weißgerber <weissgerb...@ksnb.de> wrote:
> 
> Hello,
> 
> some time ago I upgraded our 6 node cluster (0.94.9) running on Ubuntu from 
> Trusty
> to Xenial.
> 
> The problem here was that with the os update also ceph is upgraded what we 
> did not want
> in the same step because then we had to upgrade all nodes at the same time.
> 
> Therefore we did it node by node first freeing the osd's on the node with 
> setting the weight to 0.
> 
> After os update, configuring the right ceph version for our setup and testing 
> the reboot so that
> all components start up correctly we set the osd weights to the normal value 
> so that the
> cluster was rebalancing.
> 
> With this procedure the cluster was always up.
> 
> Regards
> 
> Steffen
> 
> 
>>>> "Heller, Chris" <chel...@akamai.com> schrieb am Montag, 27. Februar 2017 um
> 18:01:
>> I am attempting an operating system upgrade of a live Ceph cluster. Before I 
>> go an screw up my production system, I have been testing on a smaller 
>> installation, and I keep running into issues when bringing the Ceph FS 
>> metadata server online.
>> 
>> My approach here has been to store all Ceph critical files on non-root 
>> partitions, so the OS install can safely proceed without overwriting any of 
>> the Ceph configuration or data.
>> 
>> Here is how I proceed:
>> 
>> First I bring down the Ceph FS via `ceph mds cluster_down`.
>> Second, to prevent OSDs from trying to repair data, I run `ceph osd set 
>> noout`
>> Finally I stop the ceph processes in the following order: ceph-mds, 
>> ceph-mon, 
>> ceph-osd
>> 
>> Note my cluster has 1 mds and 1 mon, and 7 osd.
>> 
>> I then install the new OS and then bring the cluster back up by walking the 
>> steps in reverse:
>> 
>> First I start the ceph processes in the following order: ceph-osd, ceph-mon, 
>> ceph-mds
>> Second I restore OSD functionality with `ceph osd unset noout`
>> Finally I bring up the Ceph FS via `ceph mds cluster_up`
>> 
>> Everything works smoothly except the Ceph FS bring up. The MDS starts in the 
>> active:replay state and eventually crashes with the following backtrace:
>> 
>> starting mds.cuba at :/0
>> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors 
>> {default=true}
>> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got 
>> (2) No such file or directory
>> mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void 
>> SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 
>> 2017-02-27 16:56:08.537739
>> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to 
>> load sessionmap")
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x8b) [0x98bb4b]
>> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 5: (()+0x8192) [0x7f31d9c8f192]
>> 6: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc 
>> <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, 
>> ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739
>> mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to 
>> load sessionmap")
>> 
>> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x8b) [0x98bb4b]
>> 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
>> 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
>> 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
>> 5: (()+0x8192) [0x7f31d9c8f192]
>> 6: (clone()+0x6d) [0x7f31d919c51d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> 
>> -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors 
>> {default=true}
>>   -1>

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-28 Thread Heller, Chris
13: (ReplicatedBackend::do_pull(std::tr1::shared_ptr)+0xd6) 
[0x974836]
 14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr)+0x3ed) 
[0x97b89d]
 15: (ReplicatedPG::do_request(std::tr1::shared_ptr&, 
ThreadPool::TPHandle&)+0x19d) [0x80b84d]
 16: (OSD::dequeue_op(boost::intrusive_ptr, 
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3cd) [0x67720d]
 17: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x2f9) [0x6776f9]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85c) 
[0xb3c7bc]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb3e8b0]
 20: (()+0x8192) [0x7fb2b9166192]
 21: (clone()+0x6d) [0x7fb2b867351d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I'm not sure why I would have encountered an issue since the data was at rest 
before the install (unless there is another step that was needed).

Currently the cluster is recovering objects. Although `ceph osd stat` shows 
that the 'norecover' flag is still set.

I'm going to wait out the recovery and see if the Ceph FS is OK. That would be 
huge if it is. But I am curious why I lost an OSD, and why recovery is 
happening with 'norecover' still set.

-Chris

> On Feb 28, 2017, at 4:51 AM, Peter Maloney 
> <peter.malo...@brockmann-consult.de> wrote:
> 
> On 02/27/17 18:01, Heller, Chris wrote:
>> First I bring down the Ceph FS via `ceph mds cluster_down`.
>> Second, to prevent OSDs from trying to repair data, I run `ceph osd set 
>> noout`
>> Finally I stop the ceph processes in the following order: ceph-mds, 
>> ceph-mon, ceph-osd
>> 
> This is the wrong procedure. Likely it will just involve more cpu and memory 
> usage on startup, not broken behavior (unless you run out of RAM). After all, 
> it has to recover from power outages, so any order ought to work, just some 
> are better. 
> 
> I am unsure on the cephfs part... but I would think you have it right, except 
> I wouldn't do `ceph mds cluster_down` (but don't know if it's right to)... 
> maybe try without that. I never used that except when I want to remove all 
> mds nodes and destroy all the cephfs data. And I didn't find any docs on what 
> it really even does, except it won't let you remove all your mds and destroy 
> the cephfs without it.
> 
> The correct procedure as far as I know is:
> 
> ## 1. cluster must be healthy and to set noout, norecover, norebalance, 
> nobackfill
> ceph -s
> for s in noout norecover norebalance nobackfill; do ceph osd set $s; done
> 
> ## 2. shut down all OSDs and then the all MONs - not MONs before OSDs
> # all nodes
> service ceph stop osd
> 
> # see that all osds are down
> ceph osd tree
> 
> # all nodes again
> ceph -s
> service ceph stop
> 
> ## 3. start MONs before OSDs. 
> # This already happens on boot per node, but not cluster wide. But with the 
> flags set, it likely doesn't matter. It seems unnecessary on a small cluster.
> 
> ## 4. unset the flags
> # see that all osds are up
> ceph -s
> ceph osd tree
> for s in noout norecover norebalance nobackfill; do ceph osd unset $s; done
> 
> 
>> Note my cluster has 1 mds and 1 mon, and 7 osd.
>> 
>> I then install the new OS and then bring the cluster back up by walking the 
>> steps in reverse:
>> 
>> First I start the ceph processes in the following order: ceph-osd, ceph-mon, 
>> ceph-mds
>> Second I restore OSD functionality with `ceph osd unset noout`
>> Finally I bring up the Ceph FS via `ceph mds cluster_up`
>> 
> adjust those steps too... mons start first
> 
>> Everything works smoothly except the Ceph FS bring up.[...snip...]
> 
>> How can I safely stop a Ceph cluster, so that it will cleanly start back up 
>> again?
>> 
> Don't know about the cephfs problem... all I can say is try the right general 
> procedure and see if the result changes.
> 
> (and I'd love to cite a source on why that's the right procedure and yours 
> isn't, but don't know what to cite... for 
> examplehttp://docs.ceph.com/docs/jewel/rados/operations/operating/#id8 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_jewel_rados_operations_operating_-23id8=DwMC-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Z8WoC0W2zZz1lpQRQ7ZPyt6UhQYV0sd_92NRYqdlNfs=ht25eyn3seVNB8DsSgfz4p1j4TIoNXEN2wBq0P4sU-Y=>
>  says to use -a in the arguments, but doesn't say whether that's systemd or 
> not, or what it does exactly. I have only seen it discussed a few places, 
> like the mailing list and IRC)
>> -Chris
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lis

[ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-27 Thread Heller, Chris
 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) 
[0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) **
in thread 7f31d30df700

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: ceph_mds() [0x89984a]
2: (()+0x10350) [0x7f31d9c97350]
3: (gsignal()+0x39) [0x7f31d90d8c49]
4: (abort()+0x148) [0x7f31d90dc058]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555]
6: (()+0x5e6f6) [0x7f31d99e16f6]
7: (()+0x5e723) [0x7f31d99e1723]
8: (()+0x5e942) [0x7f31d99e1942]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) 
[0x98bd38]
10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4]
11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5]
12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0]
13: (()+0x8192) [0x7f31d9c8f192]
14: (clone()+0x6d) [0x7f31d919c51d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

How can I safely stop a Ceph cluster, so that it will cleanly start back up 
again?

-Chris



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] civetweb deamon dies on https port

2017-01-19 Thread Chris Sarginson
You look to have a typo in this line:

rgw_frontends = "civetweb port=8080s ssl_certificate=/etc/pki/tls/
cephrgw01.crt"

It would seem from the error it should be port=8080, not port=8080s.

On Thu, 19 Jan 2017 at 08:59 Iban Cabrillo  wrote:

> Dear cephers,
>I just finish the integration between radosgw
> (ceph-radosgw-10.2.5-0.el7.x86_64) and keystone.
>
>This is my ceph conf for radosgw:
>
> [client.rgw.cephrgw]
> host = cephrgw01
> rgw_frontends = "civetweb port=8080s
> ssl_certificate=/etc/pki/tls/cephrgw01.crt"
> rgw_zone = RegionOne
> keyring = /etc/ceph/ceph.client.rgw.cephrgw.keyring
> log_file = /var/log/ceph/client.rgw.cephrgw.log
> rgw_keystone_url = https://keystone:5000
> rgw_keystone_admin_user = 
> rgw_keystone_admin_password = YY
> rgw_keystone_admin_tenant = service
> rgw_keystone_accepted_roles = admin Member
> rgw keystone admin project = service
> rgw keystone admin domain = admin
> rgw keystone api version = 2
> rgw_s3_auth_use_keystone = true
> nss_db_path = /var/ceph/nss/
> rgw_keystone_verify_ssl = true
>
>   Seems to be working fine using latets jewel version 10.2.5, but seems
> that now I cannot listen on secure port. Older version was running fine the
> rgw_frontends option (10.2.3), but now, Server starts (I can access to
> https://cephrgw01:8080/) but after a couple of minutes the radosgw deamon
> stops:
>
> error parsing int: 8080s: The option value '8080s' seems to be invalid
>
>   Is there any change on this parameter with the new jewel version
> (ceph-radosgw-10.2.5-0.el7.x86_64)?
>
> regards, I
>
>
>
>
> --
>
> 
> Iban Cabrillo Bartolome
> Instituto de Fisica de Cantabria (IFCA)
> Santander, Spain
> Tel: +34942200969 <+34%20942%2020%2009%2069>
> PGP PUBLIC KEY:
> http://pgp.mit.edu/pks/lookup?op=get=0xD9DF0B3D6C8C08AC
>
> 
> Bertrand Russell:*"El problema con el mundo es que los estúpidos están
> seguros de todo y los inteligentes están **llenos de dudas*"
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph.com

2017-01-16 Thread Chris Jones
The site looks great! Good job!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Monitoring

2017-01-13 Thread Chris Jones
Thanks.

What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor
for those type and if so what criteria do you use?

Thanks again!

On Fri, Jan 13, 2017 at 3:28 PM, David Turner <david.tur...@storagecraft.com
> wrote:

> We don't use many critical alerts (that will have our NOC wake up an
> engineer), but the main one that we do have is a check that tells us if
> there are 2 or more hosts with osds that are down.  We have clusters with
> 60 servers in them, so having an osd die and backfill off of isn't
> something to wake up for in the middle of the night, but having osds down
> on 2 servers is 1 osd away from data loss.  A quick reference to how to do
> this check in bash is below.
>
> hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down |
> grep host | wc -l`
> if [ $hosts_with_down_osds -ge 2 ]
> then
> echo critical
> elif [ $hosts_with_down_osds -eq 1 ]
> then
> echo warning
> elif [ $hosts_with_down_osds -eq 0 ]
> then
> echo ok
> else
> echo unknown
> fi
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Chris
> Jones [cjo...@cloudm2.com]
> *Sent:* Friday, January 13, 2017 1:15 PM
> *To:* ceph-us...@ceph.com
> *Subject:* [ceph-users] Ceph Monitoring
>
> General question/survey:
>
> Those that have larger clusters, how are you doing alerting/monitoring?
> Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about
> collectd related but more on initial alerts of an issue or potential issue?
> What threshold do you use basically? Just trying to get a pulse of what
> others are doing.
>
> Thanks in advance.
>
> --
> Best Regards,
> Chris Jones
> ​Bloomberg​
>
>
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Monitoring

2017-01-13 Thread Chris Jones
General question/survey:

Those that have larger clusters, how are you doing alerting/monitoring?
Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about
collectd related but more on initial alerts of an issue or potential issue?
What threshold do you use basically? Just trying to get a pulse of what
others are doing.

Thanks in advance.

-- 
Best Regards,
Chris Jones
​Bloomberg​
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage system

2017-01-04 Thread Chris Jones
Based on this limited info, Object storage if behind proxy. We use Ceph
behind HAProxy and hardware load-balancers at Bloomberg. Our Chef recipes
are at https://github.com/ceph/ceph-chef and
https://github.com/bloomberg/chef-bcs. The chef-bcs cookbooks show the
HAProxy info.

Thanks,
Chris

On Wed, Jan 4, 2017 at 11:51 AM, Patrick McGarry <pmcga...@redhat.com>
wrote:

> Moving this to ceph-user list where it'll get some attention.
>
> On Thu, Dec 22, 2016 at 2:08 PM, SIBALA, SATISH <ss9...@att.com> wrote:
>
>> Hi,
>>
>>
>>
>> Could you please give me an recommendation on kind of Ceph storage to be
>> used with NGINX proxy server (Object / Block / FileSystem)?
>>
>>
>>
>> Best Regards
>>
>> Satish
>>
>> [image: cid:image001.png@01D1D904.36628D60]
>>
>>
>>
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS FAILED assert(dn->get_linkage()->is_null())

2016-12-09 Thread Chris Sarginson
Hi Goncarlo,

In the end we ascertained that the assert was coming from reading corrupt
data in the mds journal.  We have followed the sections at the following
link (http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/) in order
down to (and including) MDS Table wipes (only wiping the "session" table in
the final step).  This resolved the problem we had with our mds asserting.

We have also run a cephfs scrub to validate the data (ceph daemon mds.0
scrub_path / recursive repair), which has resulted in "metadata damage
detected" health warning.  This seems to perform a read of all objects
involved in cephfs rados pools (anecdotal: performance of the scan against
the data pool was much faster to process than the metadata pool itself).

We are now working with the output of "ceph tell mds.0 damage ls", and
looking at the following mailing list post as a starting point for
proceeding with that:
http://ceph-users.ceph.narkive.com/EfFTUPyP/how-to-fix-the-mds-damaged-issue

Chris

On Fri, 9 Dec 2016 at 19:26 Goncalo Borges <goncalo.bor...@sydney.edu.au>
wrote:

> Hi Sean, Rob.
>
> I saw on the tracker that you were able to resolve the mds assert by
> manually cleaning the corrupted metadata. Since I am also hitting that
> issue and I suspect that i will face an mds assert of the same type sooner
> or later, can you please explain a bit further what operations did you do
> to clean the problem?
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rob
> Pickerill [r.picker...@gmail.com]
> Sent: 09 December 2016 07:13
> To: Sean Redmond; John Spray
> Cc: ceph-users
> Subject: Re: [ceph-users] CephFS FAILED
> assert(dn->get_linkage()->is_null())
>
> Hi John / All
>
> Thank you for the help so far.
>
> To add a further point to Sean's previous email, I see this log entry
> before the assertion failure:
>
> -6> 2016-12-08 15:47:08.483700 7fb133dca700 12
> mds.0.cache.dir(1000a453344) remove_dentry [dentry
> #100/stray9/1000a453344/config [2,head] auth NULL (dver
> sion lock) v=540 inode=0 0x55e8664fede0]
> -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In
> function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700
> time 2016-12-08
> 15:47:08.483704
> mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
>
> And I can reference this with:
>
> root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys
> 1000a453344.
> 1470734502_head
> config_head
>
> Would we also need to clean up this object, if so is there a safe we can
> do this?
>
> Rob
>
> On Thu, 8 Dec 2016 at 19:58 Sean Redmond <sean.redmo...@gmail.com sean.redmo...@gmail.com>> wrote:
> Hi John,
>
> Thanks for your pointers, I have extracted the onmap_keys and onmap_values
> for an object I found in the metadata pool called '600.' and
> dropped them at the below location
>
> https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0
>
> Could you explain how is it possible to identify stray directory fragments?
>
> Thanks
>
> On Thu, Dec 8, 2016 at 6:30 PM, John Spray <jsp...@redhat.com jsp...@redhat.com>> wrote:
> On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond <sean.redmo...@gmail.com
> <mailto:sean.redmo...@gmail.com>> wrote:
> > Hi,
> >
> > We had no changes going on with the ceph pools or ceph servers at the
> time.
> >
> > We have however been hitting this in the last week and it maybe related:
> >
> > http://tracker.ceph.com/issues/17177
>
> Oh, okay -- so you've got corruption in your metadata pool as a result
> of hitting that issue, presumably.
>
> I think in the past people have managed to get past this by taking
> their MDSs offline and manually removing the omap entries in their
> stray directory fragments (i.e. using the `rados` cli on the objects
> starting "600.").
>
> John
>
>
>
> > Thanks
> >
> > On Thu, Dec 8, 2016 at 3:34 PM, John Spray <jsp...@redhat.com jsp...@redhat.com>> wrote:
> >>
> >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond <sean.redmo...@gmail.com
> <mailto:sean.redmo...@gmail.com>>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a CephFS cluster that is currently unable to start the mds
> server
> >> > as
> >> > it is hitting an assert, the extract from the mds log is below, any
> >> > pointers
> >> > are welcome:
> >> >
> >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >> >
> >> >

Re: [ceph-users] rgw civetweb ssl official documentation?

2016-12-07 Thread Chris Jones
We terminate all of our TLS at the load-balancer. To make it simple, use
HAProxy in front of your single instance. BTW, the latest versions of
HAProxy can out perform expensive hardware LBs. We use both at Bloomberg.

-CJ

On Wed, Dec 7, 2016 at 1:44 PM, Puff, Jonathon <jonathon.p...@netapp.com>
wrote:

> There’s a few documents out around this subject, but I can’t find anything
> official.  Can someone point me to any official documentation for deploying
> this?   Other alternatives appear to be a HAproxy frontend.  Currently
> running 10.2.3 with a single radosgw.
>
>
>
> -JP
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How are replicas spread in default crush configuration?

2016-11-23 Thread Chris Taylor
 

Kevin, 

After changing the pool size to 3, make sure the min_size is set to 1 to
allow 2 of the 3 hosts to be offline. 

http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
[2] 

How many MONs do you have and are they on the same OSD hosts? If you
have 3 MONs running on the OSD hosts and two go offline, you will not
have a quorum of MONs and I/O will be blocked. 

I would also check your CRUSH map. I believe you want to make sure your
rules have "step chooseleaf firstn 0 type host" and not "... type osd"
so that replicas are on different hosts. I have not had to make that
change before so you will want to read up on it first. Don't take my
word for it. 

http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters
[3] 

Hope that helps. 

Chris 

On 2016-11-23 1:32 pm, Kevin Olbrich wrote: 

> Hi, 
> 
> just to make sure, as I did not find a reference in the docs: 
> Are replicas spread across hosts or "just" OSDs? 
> 
> I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently 
> each OSD is a ZFS backed storage array. 
> Now I installed a server which is planned to host 4x OSDs (and setting size 
> to 3). 
> 
> I want to make sure we can resist two offline hosts (in terms of hardware). 
> Is my assumption correct? 
> 
> Mit freundlichen Grüßen / best regards,
> Kevin Olbrich. 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]
 

Links:
--
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[2]
http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
[3]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-14 Thread Chris Taylor
 

Maybe a long shot, but have you checked OSD memory usage? Are the OSD
hosts low on RAM and swapping to disk? 

I am not familiar with your issue, but though that might cause it. 

Chris 

On 2016-11-14 3:29 pm, Brad Hubbard wrote: 

> Have you looked for clues in the output of dump_historic_ops ? 
> 
> On Tue, Nov 15, 2016 at 1:45 AM, Thomas Danan <thomas.da...@mycom-osi.com> 
> wrote:
> 
> Thanks Luis, 
> 
> Here are some answers  
> 
> Journals are not on SSD and collocated with OSD daemons host. 
> 
> We look at the disk performances and did not notice anything wrong with 
> acceptable rw latency < 20ms. 
> 
> No issue on the network as well from what we have seen. 
> 
> There is only one pool in the cluster so pool size = cluster size. 
> Replication factor is default: 3 and there is no erasure coding. 
> 
> We tried to stop deep scrub but without notable effect. 
> 
> We have one near full OSD and adding new DN but I doubt this could be the 
> issue. 
> 
> I doubt we are hitting cluster limits but if it was the case, then adding new 
> DN should help. Also writing to primary OSD is working fine whereas writing 
> on secondary OSD is often blocked. Last is recovery can be very fast (several 
> GB/s) and seems never blocked, where client RW IOs are about several hundred 
> MB /s and are too much often blocked when writing replicas. 
> 
> Thomas 
> 
> FROM: Luis Periquito [mailto:periqu...@gmail.com] 
> SENT: lundi 14 novembre 2016 16:23
> TO: Thomas Danan
> CC: ceph-users@lists.ceph.com
> SUBJECT: Re: [ceph-users] ceph cluster having blocke requests very frequently 
> 
> Without knowing the cluster architecture it's hard to know exactly what may 
> be happening. 
> 
> How is the cluster hardware? Where are the journals? How busy are the disks 
> (% time busy)? What is the pool size? Are these replicated or EC pools? 
> 
> Have you tried tuning the deep-scrub processes? Have you tried stopping them 
> altogether? Are the journals on SSDs? As a first feeling the cluster may be 
> hitting it's limits (also you have at least one OSD getting full)... 
> 
> On Mon, Nov 14, 2016 at 3:16 PM, Thomas Danan <thomas.da...@mycom-osi.com> 
> wrote: 
> 
> Hi All, 
> 
> We have a cluster in production who is suffering from intermittent blocked 
> request (25 requests are blocked > 32 sec). The blocked request occurrences 
> are frequent and global to all OSDs. 
> 
> From the OSD daemon logs, I can see related messages: 
> 
> 16-11-11 18:25:29.917518 7fd28b989700 0 log_channel(cluster) log [WRN] : slow 
> request 30.429723 seconds old, received at 2016-11-11 18:24:59.487570: 
> osd_op(client.2406272.1:336025615 rbd_data.66e952ae8944a.00350167 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~524288] 
> 0.8d3c9da5 snapc 248=[248,216] ondisk+write e201514) currently waiting for 
> subops from 210,499,821 
> 
> . So I guess the issue is related to replication process when writing new 
> data on the cluster. Again it is never the same secondary OSDs that are 
> displayed in OSD daemon logs. 
> 
> As a result we are experiencing very important IO Write latency on ceph 
> client side (can be up to 1 hour !!!). 
> 
> We have checked Network health as well as disk health but we wre not able to 
> find any issue. 
> 
> Wanted to know if this issue was already observed or if you have ideas to 
> investigate / WA the issue. 
> 
> Many thanks... 
> 
> Thomas 
> 
> The cluster is composed with 37DN and 851 OSDs and 5 MONs 
> 
> The Ceph clients are accessing the client with RBD 
> 
> Cluster is Hammer 0.94.5 version 
> 
> cluster 1a26e029-3734-4b0e-b86e-ca2778d0c990 
> 
> health HEALTH_WARN 
> 
> 25 requests are blocked > 32 sec 
> 
> 1 near full osd(s) 
> 
> noout flag(s) set 
> 
> monmap e3: 5 mons at 
> {NVMBD1CGK190D00=10.137.81.13:6789/0,nvmbd1cgy050d00=10.137.78.226:6789/0,nvmbd1cgy070d00=10.137.78.232:6789/0,nvmbd1cgy090d00=10.137.78.228:6789/0,nvmbd1cgy130d00=10.137.78.218:6789/0
>  [1]} 
> 
> election epoch 664, quorum 0,1,2,3,4 
> nvmbd1cgy130d00,nvmbd1cgy050d00,nvmbd1cgy090d00,nvmbd1cgy070d00,NVMBD1CGK190D00
>  
> 
> osdmap e205632: 851 osds: 850 up, 850 in 
> 
> flags noout 
> 
> pgmap v25919096: 10240 pgs, 1 pools, 197 TB data, 50664 kobjects 
> 
> 597 TB used, 233 TB / 831 TB avail 
> 
> 10208 active+clean 
> 
> 32 active+clean+scrubbing+deep 
> 
> client io 97822 kB/s rd, 205 MB/s wr, 2402 op/s 
> 
> THANK YOU 
> 
> THOMAS DANAN 
> 
> DIRECTOR OF PRODUCT DEVELOPMENT 
> 
> Office +33 1 49 03 77 53 [2] 
> 
> Mobile +33 7 76 35 76 43 [3] 
> 
> Skype thomas.danan 
> 
>

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Just a thought, but since a directory tree is a first class item in cephfs, 
could the wire protocol be extended with an “recursive delete” operation, 
specifically for cases like this?

On 10/14/16, 4:16 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris <chel...@akamai.com> wrote:
> Ok. Since I’m running through the Hadoop/ceph api, there is no syscall 
boundary so there is a simple place to improve the throughput here. Good to 
know, I’ll work on a patch…

Ah yeah, if you're in whatever they call the recursive tree delete
function you can unroll that loop a whole bunch. I forget where the
boundary is so you may need to go play with the JNI code; not sure.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary 
so there is a simple place to improve the throughput here. Good to know, I’ll 
work on a patch…

On 10/14/16, 3:58 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Fri, Oct 14, 2016 at 11:41 AM, Heller, Chris <chel...@akamai.com> wrote:
> Unfortunately, it was all in the unlink operation. Looks as if it took 
nearly 20 hours to remove the dir, roundtrip is a killer there. What can be 
done to reduce RTT to the MDS? Does the client really have to sequentially 
delete directories or can it have internal batching or parallelization?

It's bound by the same syscall APIs as anything else. You can spin off
multiple deleters; I'd either keep them on one client (if you want to
work within a single directory) or if using multiple clients assign
them to different portions of the hierarchy. That will let you
parallelize across the IO latency until you hit a cap on the MDS'
total throughput (should be 1-10k deletes/s based on latest tests
IIRC).
    -Greg

>
> -Chris
>
> On 10/13/16, 4:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
    > On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris <chel...@akamai.com> 
wrote:
> > I have a directory I’ve been trying to remove from cephfs (via
> > cephfs-hadoop), the directory is a few hundred gigabytes in size and
> > contains a few million files, but not in a single sub directory. I 
startd
> > the delete yesterday at around 6:30 EST, and it’s still 
progressing. I can
> > see from (ceph osd df) that the overall data usage on my cluster is
> > decreasing, but at the rate its going it will be a month before the 
entire
> > sub directory is gone. Is a recursive delete of a directory known 
to be a
> > slow operation in CephFS or have I hit upon some bad configuration? 
What
> > steps can I take to better debug this scenario?
>
> Is it the actual unlink operation taking a long time, or just the
> reduction in used space? Unlinks require a round trip to the MDS
> unfortunately, but you should be able to speed things up at least some
> by issuing them in parallel on different directories.
>
> If it's the used space, you can let the MDS issue more RADOS delete
> ops by adjusting the "mds max purge files" and "mds max purge ops"
> config values.
> -Greg
>
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Unfortunately, it was all in the unlink operation. Looks as if it took nearly 
20 hours to remove the dir, roundtrip is a killer there. What can be done to 
reduce RTT to the MDS? Does the client really have to sequentially delete 
directories or can it have internal batching or parallelization?

-Chris

On 10/13/16, 4:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris <chel...@akamai.com> wrote:
> I have a directory I’ve been trying to remove from cephfs (via
> cephfs-hadoop), the directory is a few hundred gigabytes in size and
> contains a few million files, but not in a single sub directory. I startd
> the delete yesterday at around 6:30 EST, and it’s still progressing. I can
> see from (ceph osd df) that the overall data usage on my cluster is
> decreasing, but at the rate its going it will be a month before the entire
> sub directory is gone. Is a recursive delete of a directory known to be a
> slow operation in CephFS or have I hit upon some bad configuration? What
> steps can I take to better debug this scenario?

Is it the actual unlink operation taking a long time, or just the
reduction in used space? Unlinks require a round trip to the MDS
unfortunately, but you should be able to speed things up at least some
by issuing them in parallel on different directories.

If it's the used space, you can let the MDS issue more RADOS delete
ops by adjusting the "mds max purge files" and "mds max purge ops"
config values.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"

2016-10-13 Thread Chris Murray

On 13/10/2016 11:49, Henrik Korkuc wrote:
Is apt/dpkg doing something now? Is problem repeatable, e.g. by 
killing upgrade and starting again. Are there any stuck systemctl 
processes?


I had no problems upgrading 10.2.x clusters to 10.2.3

On 16-10-13 13:41, Chris Murray wrote:

On 22/09/2016 15:29, Chris Murray wrote:

Hi all,

Might anyone be able to help me troubleshoot an "apt-get dist-upgrade"
which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"?

I'm upgrading from 10.2.2. The two OSDs on this node are up, and think
they are version 10.2.3, but the upgrade doesn't appear to be finishing
... ?

Thank you in advance,
Chris


Hi,

Are there possibly any pointers to help troubleshoot this? I've got a 
test system on which the same thing has happened.


The cluster's status is "HEALTH_OK" before starting. I'm running 
Debian Jessie.


dpkg.log only has the following:

2016-10-13 11:37:25 configure ceph-osd:amd64 10.2.3-1~bpo80+1 
2016-10-13 11:37:25 status half-configured ceph-osd:amd64 
10.2.3-1~bpo80+1


At this point, the ugrade gets stuck and doesn't go any further. 
Where could I look for the next clue?


Thanks,

Chris


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thank you Henrik, I see it's a systemctl process that's stuck.

It is reproducible for me on every run of  dpkg --configure -a

And, indeed, reproducible across two separate machines.

I'll pursue the stuck "/bin/systemctl start ceph-osd.target".

Thanks again,
Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs slow delete

2016-10-13 Thread Heller, Chris
I have a directory I’ve been trying to remove from cephfs (via cephfs-hadoop), 
the directory is a few hundred gigabytes in size and contains a few million 
files, but not in a single sub directory. I startd the delete yesterday at 
around 6:30 EST, and it’s still progressing. I can see from (ceph osd df) that 
the overall data usage on my cluster is decreasing, but at the rate its going 
it will be a month before the entire sub directory is gone. Is a recursive 
delete of a directory known to be a slow operation in CephFS or have I hit upon 
some bad configuration? What steps can I take to better debug this scenario?

-Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"

2016-10-13 Thread Chris Murray

On 22/09/2016 15:29, Chris Murray wrote:

Hi all,

Might anyone be able to help me troubleshoot an "apt-get dist-upgrade"
which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"?

I'm upgrading from 10.2.2. The two OSDs on this node are up, and think
they are version 10.2.3, but the upgrade doesn't appear to be finishing
... ?

Thank you in advance,
Chris


Hi,

Are there possibly any pointers to help troubleshoot this? I've got a 
test system on which the same thing has happened.


The cluster's status is "HEALTH_OK" before starting. I'm running Debian 
Jessie.


dpkg.log only has the following:

2016-10-13 11:37:25 configure ceph-osd:amd64 10.2.3-1~bpo80+1 
2016-10-13 11:37:25 status half-configured ceph-osd:amd64 10.2.3-1~bpo80+1

At this point, the ugrade gets stuck and doesn't go any further. Where 
could I look for the next clue?


Thanks,

Chris


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New OSD Nodes, pgs haven't changed state

2016-10-11 Thread Chris Taylor
 

I see on this list often that peering issues are related to networking
and MTU sizes. Perhaps the HP 5400's or the managed switches did not
have jumbo frames enabled? 

Hope that helps you determine the issue in case you want to move the
nodes back to the other location. 

Chris 

On 2016-10-11 2:30 pm, Mike Jacobacci wrote: 

> Hi Goncalo, 
> 
> Thanks for your reply! I finally figured out that our issue was with the 
> physical setup of the nodes. Se had one OSD and MON node in our office and 
> the others are co-located at our ISP. We have an almost dark fiber going 
> between our two buildings connected via HP 5400's, but it really isn't since 
> there are some switches in between doing VLAN rewriting (ISP managed). 
> 
> Even though all the interfaces were communicating without issue, no data 
> would move across the nodes. I ended up moving all nodes into the same rack 
> and data immediately started moving and the cluster is now working! So it 
> seems the storage traffic was being dropped/blocked by something on our ISP 
> side. 
> 
> Cheers, 
> Mike 
> 
> On Mon, Oct 10, 2016 at 5:22 PM, Goncalo Borges 
> <goncalo.bor...@sydney.edu.au> wrote:
> 
>> Hi Mike...
>> 
>> I was hoping that someone with a bit more experience would answer you since 
>> I never had similar situation. So, I'll try to step in and help.
>> 
>> The peering process means that the OSDs are agreeing on the state of objects 
>> in the PGs they share. The peering process can take some time and is a hard 
>> operation to execute from a ceph point of view, specially if a lot of 
>> peering happens at the same time. This is one of the reasons why also the pg 
>> increase should be done in very small steps (normally increases of 256 pgs).
>> 
>> Is your cluster slowly decreasing the number of pgs in peering? and the 
>> number of active pgs increasing? If you see no evolution at all after this 
>> time, you can have a problem.
>> 
>> pgs which do not leave the peering state may be because:
>> - incorrect crush map
>> - issues in osds
>> - issues with the network
>> 
>> Check that your network is working as expected and that you do not have 
>> firewalls blocking traffic and so on.
>> 
>> A pg query for one of those peering pgs may provide some further information 
>> about what could be wrong.
>> 
>> Looking to osd logs may also show a bit of light.
>> 
>> Cheers
>> Goncalo
>> 
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mike 
>> Jacobacci [mi...@flowjo.com]
>> Sent: 10 October 2016 01:55
>> To: ceph-us...@ceph.com
>> Subject: [ceph-users] New OSD Nodes, pgs haven't changed state
>> 
>> Hi,
>> 
>> Yesterday morning I added two more OSD nodes and changed the crushmap from 
>> disk to node. It looked to me like everything went ok besides some disks 
>> missing that I can re-add later, but the cluster status hasn't changed since 
>> then. Here is the output of ceph -w:
>> 
>> cluster 395fb046-0062-4252-914c-013258c5575c
>> health HEALTH_ERR
>> 1761 pgs are stuck inactive for more than 300 seconds
>> 1761 pgs peering
>> 1761 pgs stuck inactive
>> 8 requests are blocked > 32 sec
>> crush map has legacy tunables (require bobtail, min is firefly)
>> monmap e2: 3 mons at {birkeland=192.168.10.190:6789/0,immanuel=192.168.10.1 
>> [1]<http://192.168.10.190:6789/0,immanuel=192.168.10.1 [1]> 
>> 25:6789/0,peratt=192.168.10.187:6789/0 [2]<http://192.168.10.187:6789/0 [2]>}
>> 
>> election epoch 14, quorum 0,1,2 immanuel,peratt,birkeland
>> osdmap e186: 26 osds: 26 up, 26 in; 1796 remapped pgs
>> flags sortbitwise
>> pgmap v6599413: 1796 pgs, 4 pools, 1343 GB data, 336 kobjects
>> 4049 GB used, 92779 GB / 96829 GB avail
>> 1761 remapped+peering
>> 35 active+clean
>> 2016-10-09 07:00:00.000776 mon.0 [INF] HEALTH_ERR; 1761 pgs are stuck 
>> inactive f or more than 300 seconds; 1761 pgs peering; 1761 pgs stuck 
>> inactive; 8 requests are blocked > 32 sec; crush map has legacy tunables 
>> (require bobtail, min is fir efly)
>> 
>> I have legacy tunables on since Ceph is only backing our Xenserver 
>> infrastructure. The number of pgs remapping and clean haven't changed and 
>> there isn't seem to be that much data... Is this normal behavior?
>> 
>> Here is my crushmap:
>> 
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once

[ceph-users] Stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"

2016-09-22 Thread Chris Murray
Hi all,

Might anyone be able to help me troubleshoot an "apt-get dist-upgrade"
which is stuck at "Setting up ceph-osd (10.2.3-1~bpo80+1)"?

I'm upgrading from 10.2.2. The two OSDs on this node are up, and think
they are version 10.2.3, but the upgrade doesn't appear to be finishing
... ?

Thank you in advance,
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
So just to put more info out there, here is what I’m seeing with a Spark/HDFS 
client:

2016-09-21 20:09:25.076595 7fd61c16f700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53864 s=2 pgs=50445 cs=1 
l=0 c=0x7fd5fdd371d0).fault, initiating reconnect
2016-09-21 20:09:25.077328 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50445 cs=2 
l=0 c=0x7fd5fdd371d0).connect got RESETSESSION
2016-09-21 20:09:25.077429 7fd60fd80700  0 client.585194220 
ms_handle_remote_reset on 192.168.1.190:6802/32183
2016-09-21 20:20:55.990686 7fd61c16f700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=2 pgs=50630 cs=1 
l=0 c=0x7fd5fdd371d0).fault, initiating reconnect
2016-09-21 20:20:55.990890 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50630 cs=2 
l=0 c=0x7fd5fdd371d0).fault
2016-09-21 20:21:09.385228 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.154:6800/17142 pipe(0x7fd6401e8160 sd=184 :39160 s=1 pgs=0 cs=0 l=0 
c=0x7fd6400433c0).fault

And here is its session info from ‘session ls’:

{
"id": 585194220,
"num_leases": 0,
"num_caps": 16385,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.585194220 192.168.1.157:0\/634334964",
"client_metadata": {
"ceph_sha1": "d56bdf93ced6b80b07397d57e3fa68fe68304432",
"ceph_version": "ceph version 0.94.7 
(d56bdf93ced6b80b07397d57e3fa68fe68304432)",
"entity_id": "hdfs.user",
"hostname": "a192-168-1-157.d.a.com"
}
},

-Chris

On 9/21/16, 9:27 PM, "Heller, Chris" <chel...@akamai.com> wrote:

I also went and bumped mds_cache_size up to 1 million… still seeing cache 
pressure, but I might just need to evict those clients…

On 9/21/16, 9:24 PM, "Heller, Chris" <chel...@akamai.com> wrote:

What is the interesting value in ‘session ls’? Is it ‘num_leases’ or 
‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for 
many many clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> 
wrote:
> I’m suspecting something similar, we have millions of files and 
can read a huge subset of them at a time, presently the client is Spark 1.5.2 
which I suspect is leaving the closing of file descriptors up to the garbage 
collector. That said, I’d like to know if I could verify this theory using the 
ceph tools. I’ll try upping “mds cache size”, are there any other configuration 
settings I might adjust to perhaps ease the problem while I track it down in 
the HDFS tools layer?

That's the big one. You can also go through the admin socket 
commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris 
<chel...@akamai.com> wrote:
> > Ok. I just ran into this issue again. The mds rolled after 
many clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if 
say they
> overloaded the MDS so much it got stuck in a directory read 
loop.
> ...actually now I come to think of it, I think there was some 
problem
> with Hadoop not being nice about closing files and so forcing 
clients
> to keep them pinned, which will make the MDS pretty unhappy 
if they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I also went and bumped mds_cache_size up to 1 million… still seeing cache 
pressure, but I might just need to evict those clients…

On 9/21/16, 9:24 PM, "Heller, Chris" <chel...@akamai.com> wrote:

What is the interesting value in ‘session ls’? Is it ‘num_leases’ or 
‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for 
many many clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> 
wrote:
> I’m suspecting something similar, we have millions of files and can 
read a huge subset of them at a time, presently the client is Spark 1.5.2 which 
I suspect is leaving the closing of file descriptors up to the garbage 
collector. That said, I’d like to know if I could verify this theory using the 
ceph tools. I’ll try upping “mds cache size”, are there any other configuration 
settings I might adjust to perhaps ease the problem while I track it down in 
the HDFS tools layer?

That's the big one. You can also go through the admin socket commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris 
<chel...@akamai.com> wrote:
> > Ok. I just ran into this issue again. The mds rolled after many 
clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if say 
they
> overloaded the MDS so much it got stuck in a directory read loop.
> ...actually now I come to think of it, I think there was some 
problem
> with Hadoop not being nice about closing files and so forcing 
clients
> to keep them pinned, which will make the MDS pretty unhappy if 
they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 
up:standby-replay, 3 up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 
kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 active+clean+scrubbing
> >6 active+clean+scrubbing+deep
> >
> > And here are the ops in flight:
> >
> > # ceph daemon mds.a190 dump_ops_in_flight
> > {
> > "ops": [],
> > "num_ops": 0
> > }
> >
> > And a tail of the active mds log at debug_mds 5/5
> >
> > 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> > 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
>
> This is also pretty solid evidence that the MDS is zapping clients
> when they misbehave.
>
> You can increase "mds cache size" past its default 10 
dentries and
> see if that alleviates (or just draws out) the problem.
> -Greg
>
> > 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> > 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache 
check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, 
baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 
caps per inode
> > 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 
epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> > 2016-09-

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ 
leases appears to be, on average, 1. But caps seems to be 16385 for many many 
clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris <chel...@akamai.com> wrote:
> I’m suspecting something similar, we have millions of files and can read 
a huge subset of them at a time, presently the client is Spark 1.5.2 which I 
suspect is leaving the closing of file descriptors up to the garbage collector. 
That said, I’d like to know if I could verify this theory using the ceph tools. 
I’ll try upping “mds cache size”, are there any other configuration settings I 
might adjust to perhaps ease the problem while I track it down in the HDFS 
tools layer?

That's the big one. You can also go through the admin socket commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> 
wrote:
> > Ok. I just ran into this issue again. The mds rolled after many 
clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if say they
> overloaded the MDS so much it got stuck in a directory read loop.
> ...actually now I come to think of it, I think there was some problem
> with Hadoop not being nice about closing files and so forcing clients
> to keep them pinned, which will make the MDS pretty unhappy if they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 
up:standby-replay, 3 up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 
kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 active+clean+scrubbing
> >6 active+clean+scrubbing+deep
> >
> > And here are the ops in flight:
> >
> > # ceph daemon mds.a190 dump_ops_in_flight
> > {
> > "ops": [],
> > "num_ops": 0
> > }
> >
> > And a tail of the active mds log at debug_mds 5/5
> >
> > 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> > 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
>
> This is also pretty solid evidence that the MDS is zapping clients
> when they misbehave.
>
> You can increase "mds cache size" past its default 10 dentries and
> see if that alleviates (or just draws out) the problem.
> -Greg
>
> > 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> > 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache 
check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, 
baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 
caps per inode
> > 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 
load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> > 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 233) v1 from 
client.538555196
> > 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 365) v1 from 
client.507390467
> > 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 367) v1 from 
client.538485341
> >

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m suspecting something similar, we have millions of files and can read a huge 
subset of them at a time, presently the client is Spark 1.5.2 which I suspect 
is leaving the closing of file descriptors up to the garbage collector. That 
said, I’d like to know if I could verify this theory using the ceph tools. I’ll 
try upping “mds cache size”, are there any other configuration settings I might 
adjust to perhaps ease the problem while I track it down in the HDFS tools 
layer?

-Chris

On 9/21/16, 4:34 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote:
> Ok. I just ran into this issue again. The mds rolled after many clients 
were failing to relieve cache pressure.

That definitely could have had something to do with it, if say they
overloaded the MDS so much it got stuck in a directory read loop.
...actually now I come to think of it, I think there was some problem
with Hadoop not being nice about closing files and so forcing clients
to keep them pinned, which will make the MDS pretty unhappy if they're
holding more than it's configured for.

>
> Now here is the result of `ceph –s`
>
> # ceph -s
> cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
>  health HEALTH_OK
>  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
>  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
up:standby
>  osdmap e118886: 192 osds: 192 up, 192 in
>   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
> 69601 GB used, 37656 GB / 104 TB avail
>11309 active+clean
>   13 active+clean+scrubbing
>6 active+clean+scrubbing+deep
>
> And here are the ops in flight:
>
> # ceph daemon mds.a190 dump_ops_in_flight
> {
> "ops": [],
> "num_ops": 0
> }
>
> And a tail of the active mds log at debug_mds 5/5
>
> 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping

This is also pretty solid evidence that the MDS is zapping clients
when they misbehave.

You can increase "mds cache size" past its default 10 dentries and
see if that alleviates (or just draws out) the problem.
-Greg

> 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage 
total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, 
buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
> 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load 
mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 233) v1 from 
client.538555196
> 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 365) v1 from 
client.507390467
> 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 367) v1 from 
client.538485341
> 2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 189) v1 from 
client.538577596
> 2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 232) v1 from 
client.507430372
> 2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.491885158
> 2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.507390682
> 2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.538485996
> 2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.538486021
> 2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.5

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
Ok. I just ran into this issue again. The mds rolled after many clients were 
failing to relieve cache pressure.

Now here is the result of `ceph –s`

# ceph -s
cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
 health HEALTH_OK
 monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
 mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
up:standby
 osdmap e118886: 192 osds: 192 up, 192 in
  pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
69601 GB used, 37656 GB / 104 TB avail
   11309 active+clean
  13 active+clean+scrubbing
   6 active+clean+scrubbing+deep

And here are the ops in flight:

# ceph daemon mds.a190 dump_ops_in_flight
{
"ops": [],
"num_ops": 0
}

And a tail of the active mds log at debug_mds 5/5

2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server handle_client_request 
client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 
20:15:53.352390) v2
2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 235) v1 from client.507429717
2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage total 
1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 
0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load 
mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 233) v1 from client.538555196
2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 365) v1 from client.507390467
2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 367) v1 from client.538485341
2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 189) v1 from client.538577596
2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 232) v1 from client.507430372
2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885158
2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390682
2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538485996
2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538486021
2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390702
2016-09-21 20:15:57.604356 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390712
2016-09-21 20:15:57.693546 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390717
2016-09-21 20:15:57.819536 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885168
2016-09-21 20:15:57.894058 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390732
2016-09-21 20:15:57.983329 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390742
2016-09-21 20:15:58.077915 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538486031
2016-09-21 20:15:58.141710 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885178
2016-09-21 20:15:58.159134 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885188

-Chris

On 9/21/16, 11:23 AM, "Heller, Chris" <chel...@akamai.com> wrote:

Perhaps related, I was watching the active mds with debug_mds set to 5/5, 
when I saw this in the log:

2016-09-21 15:13:26.067698 7fbaec248700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 
c=0x5631ce0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/3252234463 pipe(0x54d10

[ceph-users] Ceph Rust Librados

2016-09-21 Thread Chris Jones
Ceph-rust for librados has been released. It's an API interface in Rust for
all of librados that's a thin layer above the C APIs. There are low-level
direct access and higher level Rust helpers that make working directly with
librados simple.

​The official repo is:
https://github.com/ceph/ceph-rust

The Rust Crate is:
ceph-rust​

​Rust is a systems programing language that gives you the speed and
low-level access of C but with the benefits of a higher level language. The
main benefits of Rust are:
1. Speed
2. Prevents segfaults
3. Guarantees thread safety
4. Strong typing
5. Compiled

You can find out more at: https://www.rust-lang.org​

​Contributions are encouraged and welcomed.

This is the base for a number of larger Ceph related projects. Updates to
the library will be frequent.​

Also, there will be new Ceph tools coming soon and you can use the
following for RGW/S3 access from Rust: (Supports V2 and V4 signatures)
Crate: aws-sdk-rust - https://github.com/lambdastackio/aws-sdk-rust

Thanks,
Chris Jones
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
othing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92cf20).fault with nothing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 
c=0x31f7a100).fault with nothing to send, going to standby
2016-09-21 15:13:26.068095 7fbafff84700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0 
c=0x41bc7e0).fault with nothing to send, going to standby
2016-09-21 15:13:26.068108 7fbb0de0e700  5 mds.0.953 handle_mds_map epoch 8471 
from mon.3
2016-09-21 15:13:26.068114 7fbaf890e700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.238:0/1422203298 pipe(0x2963 sd=44 :6802 s=2 pgs=2 cs=1 l=0 
c=0x3a740dc0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068143 7fbae860c700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0 
c=0x31f79e40).fault with no
thing to send, going to standby
2016-09-21 15:13:26.068190 7fbb040c5700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.218:0/3945638891 pipe(0x50c sd=53 :6802 s=2 pgs=2 cs=1 l=0 
c=0x56f4420).fault with nothi
ng to send, going to standby
2016-09-21 15:13:26.068200 7fbaf961b700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0 
c=0x286fa840).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068232 7fbaf981d700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0 
c=0x56f6940).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068253 7fbaeac32700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0 
c=0x286fb760).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068275 7fbb0de0e700  1 mds.-1.-1 handle_mds_map i 
(192.168.1.196:6802/13581) dne in the mdsmap, respawning myself
2016-09-21 15:13:26.068289 7fbb0de0e700  1 mds.-1.-1 respawn
2016-09-21 15:13:26.068294 7fbb0de0e700  1 mds.-1.-1  e: 'ceph-mds'
2016-09-21 15:13:26.173095 7f689baa8780  0 ceph version 0.94.7 
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581
2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors 
{default=true}
2016-09-21 15:13:27.329181 7f68969e9700  1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:28.484148 7f68969e9700  1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:33.280376 7f68969e9700  1 mds.-1.0 handle_mds_map standby

On 9/21/16, 10:48 AM, "Heller, Chris" <chel...@akamai.com> wrote:

I’ll see if I can capture the output the next time this issue arises, but 
in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph 
health detail’ results in HEALTH_OK, the mds server is in the up:active state, 
in general it’s as if nothing is wrong server side (at least from the summary).

-Chris

On 9/21/16, 10:46 AM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <chel...@akamai.com> 
wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck, 
and the
> fix so far has been to restart the active MDS (sometimes I need to 
restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no kernel 
client,
> or fuse api involved. When I see clients get stuck, there are messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 
192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 
l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue, and 
wanted
> to ping the list for ideas.

What's the full output of "ceph -s" when this happens? Have you looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j

http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some 
as well.

>
>
>
> I managed to dump the mds cache during one of the stalled moments, 
which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https:

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’ll see if I can capture the output the next time this issue arises, but in 
general the output looks as if nothing is wrong. No OSD are down, a ‘ceph 
health detail’ results in HEALTH_OK, the mds server is in the up:active state, 
in general it’s as if nothing is wrong server side (at least from the summary).

-Chris

On 9/21/16, 10:46 AM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <chel...@akamai.com> wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck, and 
the
> fix so far has been to restart the active MDS (sometimes I need to restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no kernel 
client,
> or fuse api involved. When I see clients get stuck, there are messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue, and 
wanted
> to ping the list for ideas.

What's the full output of "ceph -s" when this happens? Have you looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j

http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as 
well.

>
>
>
> I managed to dump the mds cache during one of the stalled moments, which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
>
>
>
>
>
> -Chris
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic 
issue arise where in all my MDS clients will become stuck, and the fix so far 
has been to restart the active MDS (sometimes I need to restart the subsequent 
active MDS as well).

These clients are using the cephfs-hadoop API, so there is no kernel client, or 
fuse api involved. When I see clients get stuck, there are messages printed to 
stderr like the following:

2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >> 
192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 
c=0x7feaa0a0c500).fault

I’m at somewhat of a loss on where to begin debugging this issue, and wanted to 
ping the list for ideas.

I managed to dump the mds cache during one of the stalled moments, which 
hopefully is a useful starting point:

e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8  
mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)


-Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
Ok. I’ll see about tracking down the logs (set to stderr for these tasks), and 
the metadata stuff looks interesting for future association.

Thanks,
Chris

On 9/14/16, 5:04 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

On Wed, Sep 14, 2016 at 7:02 AM, Heller, Chris <chel...@akamai.com> wrote:
> I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a
> system I’ve been experimenting with.
>
>
>
> I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ 
value
> of 16385, as seen when running ‘session ls’ on the active mds. This 
appears
> to be one larger than the default value for ‘client_cache_size’ so I 
presume
> some relation, though I have not seen any documentation to corroborate 
this.
>
>
>
> What I was hoping to do is track down which ceph client is actually 
holding
> all these ‘caps’, but since my system can have work scheduled dynamically
> and multiple clients can be running on the same host, its not obvious how 
to
> associate the client ‘id’ as reported by ‘session ls’ with any one process
> on the give host.
>
>
>
> Is there steps I can follow to back track the client ‘id’ to a process id?

Hmm, it looks like we no longer directly associate the process ID with
the client session. There is a "client metadata" config option you can
fill in with arbitrary "key=value[,key2=value2]* strings if you can
persuade Hadoop to set that to something useful on each individual
process. If you have logging or admin sockets enabled then you should
also be able to find them named by client ID and trace those back to
the pid with standard linux tooling.

I've created a ticket to put this back in as part of the standard
metadata: http://tracker.ceph.com/issues/17276
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a 
system I’ve been experimenting with.

I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ value of 
16385, as seen when running ‘session ls’ on the active mds. This appears to be 
one larger than the default value for ‘client_cache_size’ so I presume some 
relation, though I have not seen any documentation to corroborate this.

What I was hoping to do is track down which ceph client is actually holding all 
these ‘caps’, but since my system can have work scheduled dynamically and 
multiple clients can be running on the same host, its not obvious how to 
associate the client ‘id’ as reported by ‘session ls’ with any one process on 
the give host.

Is there steps I can follow to back track the client ‘id’ to a process id?

-Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pools per hypervisor?

2016-09-12 Thread Chris Taylor
We are using a single pool for all our RBD images.

You could create different pools based on performance and replication needs. 
Say one with all SSDs and one with SATA. Then put your RBD images in the 
appropriate pool.

Each host is also using the same user. You could use a different user for each 
hypervisor but that would be up to you.


Chris

 

> On Sep 11, 2016, at 9:04 PM, Thomas <tho...@tgmedia.co.nz> wrote:
> 
> Hi Guys,
> 
> Hoping to find help here as I can't seem to find anything on the net.
> 
> I have a ceph cluster and I'd want to use rbd as block storage on our 
> hypervisors (say 30) to mount drives to our guests. Would you create users 
> and pools per hypervisor?
> 
> As adding more pools to a cluster seems to be a problem if you're not sure 
> how many pools you'll end up using, e.g. I kept adding more pools with pg_num 
> 256 and now I'm at pool no. 4 and my cluster complains about 'too many PGs 
> per OSD (324 > max 300)' - any ideas ?
> 
> Cheers,
> Thomas
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph auth key generation algorithm documentation

2016-08-23 Thread Heller, Chris
I’d like to generate keys for ceph external to any system which would have 
ceph-authtool.
Looking over the ceph website and googling have turned up nothing.

Is the ceph auth key generation algorithm documented anywhere?

-Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Signature V2

2016-08-18 Thread Chris Jones
I believe RGW Hammer and below use V2 and Jewel and above use V4.

Thanks

On Thu, Aug 18, 2016 at 7:32 AM, jan hugo prins <jpr...@betterbe.com> wrote:

> did some more searching and according to some info I found RGW should
> support V4 signatures.
>
> http://tracker.ceph.com/issues/10333
> http://tracker.ceph.com/issues/11858
>
> The fact that everyone still modifies s3cmd to use Version 2 Signatures
> suggests to me that we have a bug in this code.
>
> If I use V4 signatures most of my requests work fine, but some requests
> fail on a signature error.
>
> Thanks,
> Jan Hugo Prins
>
>
> On 08/18/2016 12:46 PM, jan hugo prins wrote:
> > Hi everyone.
> >
> > To connect to my S3 gateways using s3cmd I had to set the option
> > signature_v2 in my s3cfg to true.
> > If I didn't do that I would get Signature mismatch errors and this seems
> > to be because Amazon uses Signature version 4 while the S3 gateway of
> > Ceph only supports Signature Version 2.
> >
> > Now I see the following error in a Jave project we are building that
> > should talk to S3.
> >
> > Aug 18, 2016 12:12:38 PM org.apache.catalina.core.StandardWrapperValve
> > invoke
> > SEVERE: Servlet.service() for servlet [Default] in context with path
> > [/VehicleData] threw exception
> > com.betterbe.vd.web.servlet.LsExceptionWrapper: xxx
> > caused: com.amazonaws.services.s3.model.AmazonS3Exception: null
> > (Service: Amazon S3; Status Code: 400; Error Code:
> > XAmzContentSHA256Mismatch; Request ID:
> > tx02cc6-0057b58a15-25bba-default), S3 Extended Request
> > ID: 25bba-default-default
> > at
> > com.betterbe.vd.web.dataset.requesthandler.DatasetRequestHandler.handle(
> DatasetRequestHandler.java:262)
> > at com.betterbe.vd.web.servlet.Servlet.handler(Servlet.java:141)
> > at com.betterbe.vd.web.servlet.Servlet.doPost(Servlet.java:110)
> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
> >
> > To me this looks a bit the same, though I'm not a Java developer.
> > Am I correct, and if so, can I tell the Java S3 client to use Version 2
> > signatures?
> >
> >
>
> --
> Met vriendelijke groet / Best regards,
>
> Jan Hugo Prins
> Infra and Isilon storage consultant
>
> Better.be B.V.
> Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
> T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
> jpr...@betterbe.com | www.betterbe.com
>
> This e-mail is intended exclusively for the addressee(s), and may not
> be passed on to, or made available for use by any person other than
> the addressee(s). Better.be B.V. rules out any and every liability
> resulting from any electronic transmission.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-16 Thread Heller, Chris
I’d like to understand more why the down OSD would cause the PG to get stuck 
after CRUSH was able to locate enough OSD to map the PG.

Is this some form of safety catch that prevents it from recovering, even though 
OSD.116 is no longer important for data integrity?

Marking the OSD lost is an option here, but it’s not really lost … it just 
takes some time to get a machine rebooted.
I’m still working out my operational procedures for CEPH and marking the OSD 
lost but having it pop back up once the system reboots could be an issue that 
I’m not yet sure how to resolve.

Can an OSD be marked as ‘found’ once it returns to the network?

-Chris

From: Goncalo Borges <goncalo.bor...@sydney.edu.au>
Date: Monday, August 15, 2016 at 11:36 PM
To: "Heller, Chris" <chel...@akamai.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD 
are up


Hi Chris...

The precise osd set you see now [79,8,74] was obtained on epoch 104536 but this 
was after a lot of tries as showed by the recovery section.

Actually, in the first try (on epoch 100767) osd 116 was selected somehow 
(maybe it was up at the time?) and probably the pg got stuck because it went 
down during the recover process?

recovery_state": [
{
"name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2016-08-11 11:45:06.052568",
"requested_info_from": []
},
{
"name": "Started\/Primary\/Peering",
"enter_time": "2016-08-11 11:45:06.052558",
"past_intervals": [
{
"first": 100767,
"last": 100777,
"maybe_went_rw": 1,
"up": [
79,
116,
74
],
"acting": [
79,
116,
74
],
"primary": 79,
"up_primary": 79
},

The pg query also shows

peering_blocked_by": [
{
"osd": 116,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us 
proceed"
}

Maybe, you can check the documentation in [1] and see if you think you could 
follow the suggestion inside the pg and mark osd 116 as lost. This should be 
done after proper evaluation from you.

Another thing I found strange is that in the recovery section, there are a lot 
of tries where you do not get a proper osd set. The very last recover try was 
on epoch 104540.

{
"first": 104536,
"last": 104540,
"maybe_went_rw": 1,
"up": [
2147483647,
8,
74
],
"acting": [
2147483647,
8,
74
],
"primary": 8,
"up_primary": 8
}

From [2], "When CRUSH fails to find enough OSDs to map to a PG, it will show as 
a 2147483647 which is ITEM_NONE or no OSD found.".

This could be an artifact of the peering being blocked by osd.116, or a genuine 
problem where you are not being able to get a proper osd set. That could be for 
a variety of reasons: from network issues, to osds being almost full or simply 
because the system can't get 3 osds in 3 different hosts.

Cheers

Goncalo


[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_rados_troubleshooting_troubleshooting-2Dpg_-23placement-2Dgroup-2Ddown-2Dpeering-2Dfailure=DQMDaQ=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Cq5DVZCgs9mbiZGc07mZmmOibYWa4CNvlbBNFpJAcuU=0vRtD0EvbI7L8KOHeGcZLDfYW3iNcY7bZMtHjU5MHqI=>

[2] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_rados_troubleshooting_troubleshooting-2Dpg_=DQMDaQ=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=Cq5DVZCgs9mbiZGc07mZmmOibYWa4CNvlbBNFpJAcuU=M96YeyltKJ3cxXFQSoJrk8ezhgvD667Q11kYH9uFN1o=>

On 08/16/2016 11:42 AM, Heller, Chris wrote:
Output of `ceph pg dump_stuck`

# ceph pg dump_stuck
ok
pg_stat state   up  up_primary  

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
Output of `ceph pg dump_stuck`

# ceph pg dump_stuck
ok
pg_stat state   up  up_primary  acting  acting_primary
4.2a8   down+peering[79,8,74]   79  [79,8,74]   79
4.c3down+peering[56,79,67]  56  [56,79,67]  56

-Chris

From: Goncalo Borges <goncalo.bor...@sydney.edu.au>
Date: Monday, August 15, 2016 at 9:03 PM
To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>, "Heller, Chris" 
<chel...@akamai.com>
Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD 
are up


Hi Heller...

Can you actually post the result of

   ceph pg dump_stuck ?

Cheers

G.



On 08/15/2016 10:19 PM, Heller, Chris wrote:
I’d like to better understand the current state of my CEPH cluster.

I currently have 2 PG that are in the ‘stuck unclean’ state:

# ceph health detail
HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean
pg 4.2a8 is stuck inactive for 124516.91, current state down+peering, last 
acting [79,8,74]
pg 4.c3 is stuck inactive since forever, current state down+peering, last 
acting [56,79,67]
pg 4.2a8 is stuck unclean for 124536.223284, current state down+peering, last 
acting [79,8,74]
pg 4.c3 is stuck unclean since forever, current state down+peering, last acting 
[56,79,67]
pg 4.2a8 is down+peering, acting [79,8,74]
pg 4.c3 is down+peering, acting [56,79,67]

While my cluster does currently have some down OSD, none are in the acting set 
for either PG:

ceph osd tree | grep down
73   1.0 osd.73  down0  1.0
96   1.0 osd.96  down0  1.0
110   1.0 osd.110 down0  1.0
116   1.0 osd.116 down0  1.0
120   1.0 osd.120 down0  1.0
126   1.0 osd.126 down0  1.0
124   1.0 osd.124 down0  1.0
119   1.0 osd.119 down0  1.0

I’ve queried one of the two PG, and see that recovery is currently blocked on 
OSD.116, which is indeed down, but is not part of the acting set of OSD for 
that PG:

http://pastebin.com/Rg2hK9GE<https://urldefense.proofpoint.com/v2/url?u=http-3A__pastebin.com_Rg2hK9GE=DQMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=1I7INncBAJrC1GhybLtpQDEPndNnH3g0mIg6r_dCqAk=eMFLR4yFAYyD9jbJfLHkeWwkyOqAyYN4yLpT-0xHjb8=>

This is all with CEPH version 0.94.3:

# ceph version
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

Why does this PG remain ‘stuck unclean’?
Is there some steps I can take to unstick it, given that all the acting OSD are 
up and in?

(* Re-sent, now that I’m subscribed to list *)
-Chris




___

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DQMD-g=96ZbZZcaMF4w0F4jpN6LZg=ylcFa5bBSUyTQqbx1Aqz47ec5BJJc7uk0YQ4EQKh-DY=1I7INncBAJrC1GhybLtpQDEPndNnH3g0mIg6r_dCqAk=oQbaHI6URK-ks5cOKdCtfn1wpegbytvQ4tm9HkbD5d0=>



--

Goncalo Borges

Research Computing

ARC Centre of Excellence for Particle Physics at the Terascale

School of Physics A28 | University of Sydney, NSW  2006

T: +61 2 93511937
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
I’d like to better understand the current state of my CEPH cluster.

I currently have 2 PG that are in the ‘stuck unclean’ state:

# ceph health detail
HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean
pg 4.2a8 is stuck inactive for 124516.91, current state down+peering, last 
acting [79,8,74]
pg 4.c3 is stuck inactive since forever, current state down+peering, last 
acting [56,79,67]
pg 4.2a8 is stuck unclean for 124536.223284, current state down+peering, last 
acting [79,8,74]
pg 4.c3 is stuck unclean since forever, current state down+peering, last acting 
[56,79,67]
pg 4.2a8 is down+peering, acting [79,8,74]
pg 4.c3 is down+peering, acting [56,79,67]

While my cluster does currently have some down OSD, none are in the acting set 
for either PG:

ceph osd tree | grep down
73   1.0 osd.73  down0  1.0
96   1.0 osd.96  down0  1.0
110   1.0 osd.110 down0  1.0
116   1.0 osd.116 down0  1.0
120   1.0 osd.120 down0  1.0
126   1.0 osd.126 down0  1.0
124   1.0 osd.124 down0  1.0
119   1.0 osd.119 down0  1.0

I’ve queried one of the two PG, and see that recovery is currently blocked on 
OSD.116, which is indeed down, but is not part of the acting set of OSD for 
that PG:

http://pastebin.com/Rg2hK9GE

This is all with CEPH version 0.94.3:

# ceph version
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

Why does this PG remain ‘stuck unclean’?
Is there some steps I can take to unstick it, given that all the acting OSD are 
up and in?

(* Re-sent, now that I’m subscribed to list *)
-Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW pools type

2016-06-12 Thread Chris Jones
.rgw.buckets are all we have as EC. The remainder are replication.

Thanks,
CJ

On Sun, Jun 12, 2016 at 4:12 AM, Василий Ангапов <anga...@gmail.com> wrote:

> Hello!
>
> I have a question regarding RGW pools type: what pools can be Erasure
> Coded?
> More exactly, I have the following pools:
>
> .rgw.root (EC)
> ed-1.rgw.control (EC)
> ed-1.rgw.data.root (EC)
> ed-1.rgw.gc (EC)
> ed-1.rgw.intent-log (EC)
> ed-1.rgw.buckets.data (EC)
> ed-1.rgw.meta (EC)
> ed-1.rgw.users.keys (REPL)
> ed-1.rgw.users.email (REPL)
> ed-1.rgw.users.uid (REPL)
> ed-1.rgw.users.swift (REPL)
> ed-1.rgw.users (REPL)
> ed-1.rgw.log (REPL)
> ed-1.rgw.buckets.index (REPL)
> ed-1.rgw.buckets.non-ec (REPL)
> ed-1.rgw.usage (REPL)
>
> Is that ok?
>
> Regards, Vasily
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption for data at rest support

2016-06-02 Thread chris holcombe
Hi Swami,

Yes ceph supports encryption at rest using dmcrypt.  The docs are here:
http://docs.ceph.com/docs/jewel/rados/deployment/ceph-deploy-osd/

My team has integrated this functionality into the ceph-osd charm also
if you'd like to try that out: https://jujucharms.com/ceph-osd/xenial/2
When combined with the ceph-mon charm you're up and running fast :)

-Chris

On 06/02/2016 03:57 AM, M Ranga Swami Reddy wrote:
> Hello,
> 
> Can you please share if the ceph supports the "data at rest" functionality?
> If yes, how can I achieve this? Please share any docs available.
> 
> Thanks
> Swami
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg_num

2016-05-16 Thread Chris Dunlop
Hi Christian,

On Tue, May 17, 2016 at 10:41:52AM +0900, Christian Balzer wrote:
> On Tue, 17 May 2016 10:47:15 +1000 Chris Dunlop wrote:
> Most your questions would be easily answered if you did spend a few
> minutes with even the crappiest test cluster and observing things (with
> atop and the likes). 

You're right of course. I'll set up a test cluster and start experimenting,
which I should have done before asking questions here.

> To wit, this is a test pool (12) created with 32 PGs and slightly filled
> with data via rados bench:
> ---
> # ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
> drwxr-xr-x   2 root root  4096 May 17 10:04 12.13_head
> drwxr-xr-x   2 root root  4096 May 17 10:04 12.1e_head
> drwxr-xr-x   2 root root  4096 May 17 10:04 12.b_head
> # du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
> 121M/var/lib/ceph/osd/ceph-8/current/12.13_head/
> ---
> 
> After increasing that to 128 PGs we get this:
> ---
> # ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.13_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.1e_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.2b_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.33_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.3e_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.4b_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.53_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.5e_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.6b_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.73_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.7e_head
> drwxr-xr-x   2 root root  4096 May 17 10:18 12.b_head
> # du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
> 25M /var/lib/ceph/osd/ceph-8/current/12.13_head/
> ---
> 
> Now this was fairly uneventful even on my crappy test cluster, given the
> small amount of data (which was mostly cached) and the fact that it's idle.
> 
> However consider this with 100's of GB per PG and a busy cluster and you
> get the idea where massive and very disruptive I/O comes from.

Per above, I'll experiment with this, but my first thought is I suspect
that's moving object/data files around rather than copying data, so the
overheads are in directory operations rather than data copies - not that
directory operations are free either of course.

>> Hmmm, is there a generic command-line(ish) way of determining the number
>> of OSDs involved in a pool?
>> 
> Unless you have a pool with a very small pg_num and a very large cluster
> the answer usually tends to be "all of them".

Or, as in my case, several completely independent pools (i.e. different
OSDs) in the one cluster.

> And google ("ceph number of osds per pool") is your friend:
> 
> http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

Crap. And I was just looking at that very page yesterday, in the context of
the distribution of the PGs, and completely forgot about the SUM part.

Thanks for taking the time to respond.

Chris.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg_num

2016-05-16 Thread Chris Dunlop
On Mon, May 16, 2016 at 10:40:47PM +0200, Wido den Hollander wrote:
> > Op 16 mei 2016 om 7:56 schreef Chris Dunlop <ch...@onthe.net.au>:
> > Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
> > should be equal to the pg_num": under what circumstances might you want
> > these different, apart from when actively increasing pg_num first then
> > increasing pgp_num to match? (If they're supposed to be always the same, why
> > not have a single parameter and do the "increase pg_num, then pgp_num"
> > within ceph's internals?)
> 
> pg_num is the actual amount of PGs. This you can increase without any actual 
> data moving.
> 
> pgp_num is the number CRUSH uses in the calculations. pgp_num can't be 
> greater than pg_num for that reason.

OK, I understand that from the docs. But why are they two separate
parameters? E.g., why might you increase pg_num and not pgp_num?  Or are the
two parameters purely to separate splitting the PGs (pg_num) from moving
data around (pgp_num)?

> You can slowly increase pgp_num to make sure not all your data moves at the 
> same time.

Why slowly increase pgp_num rather than rely on "osd max backfills"?  I.e.
what downsides are there to setting "osd max backfills" as appropriate,
increasing pg_num in small steps to the target, then increasing pgp_num to
the target in one step?

If you're slowly increasing pgp_num, is the recommendation to "increase
pg_num a bit, increase pgp_num a bit, repeat till target is reached" (and
thus potentially moving some data multiple times), or is the recommendation
to "increase pg_num a bit step by step to the target, then increase pgp_num
bit by bit to the target"?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing pg_num

2016-05-16 Thread Chris Dunlop
On Tue, May 17, 2016 at 08:21:48AM +0900, Christian Balzer wrote:
> On Mon, 16 May 2016 22:40:47 +0200 (CEST) Wido den Hollander wrote:
> > 
> > pg_num is the actual amount of PGs. This you can increase without any
> > actual data moving.
>
> Yes and no.
> 
> Increasing the pg_num will split PGs, which causes potentially massive I/O.
> Also AFAIK that I/O isn't regulated by the various recovery and backfill
> parameters.

Where is this potentially massive I/O coming from? I have this naive concept
that the PGs are mathematically-calculated buckets, so splitting them would
involve little or no I/O, although I can imagine there are management
overheads (cpu, memory) involved in correctly maintaining state during the
splitting process.

> That's probably why recent Ceph versions will only let you increase pg_num
> in smallish increments. 

Oh, I wasn't aware of that!

Ok, so it looks like it's mon_osd_max_split_count, introduced by commit
d8ccd73. Unfortunately it seems to be missing from the ceph docs. It's
mentioned in the Suse docs:

https://www.suse.com/documentation/ses-2/singlehtml/book_storage_admin/book_storage_admin.html#storage.bp.cluster_mntc.add_pgnum

...although, if I'm understanding "mon_osd_max_split_count" correctly, their
script for calculating the maximum to which you can increase pg_num is
incorrect in that it's calculating "current pg_num +
mon_osd_max_split_count" when it should be "current pg_num +
(mon_osd_max_split_count * number of pool OSDs)".

Hmmm, is there a generic command-line(ish) way of determining the number of
OSDs involved in a pool?

> Moving data (as in redistributing amongst the OSD based on CRUSH) will
> indeed not happen until pgp_num is also increased. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.7 Hammer released

2016-05-16 Thread Chris Dunlop
On Fri, May 13, 2016 at 10:21:51AM -0400, Sage Weil wrote:
> This Hammer point release fixes several minor bugs. It also includes a 
> backport of an improved ‘ceph osd reweight-by-utilization’ command for 
> handling OSDs with higher-than-average utilizations.
> 
> We recommend that all hammer v0.94.x users upgrade.

Per http://download.ceph.com/debian-hammer/pool/main/c/ceph/

ceph-common_0.94.7-1trusty_amd64.deb11-May-2016 16:08  5959876
ceph-common_0.94.7-1xenial_amd64.deb11-May-2016 15:54  6037236
ceph-common_0.94.7-1xenial_arm64.deb11-May-2016 16:06  5843722
ceph-common_0.94.7-1~bpo80+1_amd64.deb  11-May-2016 16:08  6028036

Once again, no debian wheezy (~bpo70) version?

Ubuntu Precise missed out this time too.

Oddly, the date on the previously released wheezy version changed at the
same time as the 0.94.7 releases above, it was previously 15-Dec-2015 15:32:

ceph-common_0.94.5-1~bpo70+1_amd64.deb  11-May-2016 15:57  9868188


Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing pg_num

2016-05-15 Thread Chris Dunlop
Hi,

I'm trying to understand the potential impact on an active cluster of
increasing pg_num/pgp_num.

The conventional wisdom, as gleaned from the mailing lists and general
google fu, seems to be to increase pg_num followed by pgp_num, both in
small increments, to the target size, using "osd max backfills" (and
perhaps "osd recovery max active"?) to control the rate and thus
performance impact of data movement.

I'd really like to understand what's going on rather than "cargo culting"
it.

I'm currently on Hammer, but I'm hoping the answers are broadly applicable
across all versions for others following the trail.

Why do we have both pg_num and pgp_num? Given the docs say "The pgp_num
should be equal to the pg_num": under what circumstances might you want
these different, apart from when actively increasing pg_num first then
increasing pgp_num to match? (If they're supposed to be always the same, why
not have a single parameter and do the "increase pg_num, then pgp_num"
within ceph's internals?)

What do "osd backfill scan min" and "osd backfill scan max" actually
control? The docs say "The minimum/maximum number of objects per backfill
scan" but what does this actually mean and how does it affect the impact (if
at all)?

Is "osd recovery max active" actually relevant to this situation? It's
mentioned in various places related to increasing pg_num/pgp_num but my
understanding is it's related to recovery (e.g. osd falls out and comes
back again and needs to catch up) rather than back filling (migrating
pgs misplaced due to increasing pg_num, crush map changes etc.)

Previously (back in Dumpling days):


http://article.gmane.org/gmane.comp.file-systems.ceph.user/11490

From: Gregory Farnum
Subject: Re: Throttle pool pg_num/pgp_num increase impact
Newsgroups: gmane.comp.file-systems.ceph.user
Date: 2014-07-08 17:01:30 GMT

On Tuesday, July 8, 2014, Kostis Fardelas wrote:
> Should we be worried that the pg/pgp num increase on the bigger pool will
> have a 300X larger impact?

The impact won't be 300 times bigger, but it will be bigger. There are two
things impacting your cluster here

1) the initial "split" of the affected PGs into multiple child PGs. You can
mitigate this by stepping through pg_num at small multiples.
2) the movement of data to its new location (when you adjust pgp_num). This
can be adjusted by setting the "OSD max backfills" and related parameters;
check the docs.
-Greg


Am I correct thinking "small multiples" in this context is along the lines
of "1.1" rather than "2" or "4"?.

Is there really much impact when increasing pg_num in a single large step
e.g. 1024 to 4096? If so, what causes this impact? An initial trial of
increasing pg_num by 10% (1024 to 1126) on one of my pools showed it
completed in a matter of tens of seconds, too short to really measure any
performance impact. But I'm concerned this could be exponential to the size
of the step such that increasing by a large step (e.g. the rest of the way
from 1126 to 4096) could cause problems.

Given the use of "osd max backfills" to limit the impact of the data
movement associated with increasing pgp_num, is there any advantage or
disadvantage to increasing pgp_num in small increments (e.g. 10% at a time)
vs "all at once", apart from small increments likely moving some data
multiple times? E.g. with a large step is there a higher potential for
problems if something else happens to the cluster the same time (e.g. an OSD
dies) because the current state of the system is further from the expected
state, or something like that?

If small increments of pgp_num are advisable, should the process be
"increase pg_num by a small increment, increase pgp_num to match, repeat
until target reached", or is that no advantage to increasing pg_num (in
multiple small increments or single large step) to the target, then
increasing pgp_num in small increments to the target - and why?

Given that increasing pg_num/pgp_num seem almost inevitable for a growing
cluster, and that increasing these can be one of the most
performance-impacting operations you can perform on a cluster, perhaps a
document going into these details would be appropriate?

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maximum MON Network Throughput Requirements

2016-05-02 Thread Chris Jones
Mons and RGWs only use the public network but Mons can have a good deal of
traffic. I would not recommend 1Gb but if looking for lower bandwidth then
10Gb would be good for most. It all depends in the overall size of the
cluster. You mentioned 40Gb. If the nodes are high density then 40Gb but if
they are lower density then 20Gb would be fine.

-CJ

On Mon, May 2, 2016 at 12:09 PM, Brady Deetz <bde...@gmail.com> wrote:

> I'm working on finalizing designs for my Ceph deployment. I'm currently
> leaning toward 40gbps ethernet for interconnect between OSD nodes and to my
> MDS servers. But, I don't really want to run 40 gig to my mon servers
> unless there is a reason. Would there be an issue with using 1 gig on my
> monitor servers?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] adding cache tier in productive hammer environment

2016-04-07 Thread Chris Taylor
 

Hi Oliver, 

Have you tried tuning some of the cluster settings to fix the IO errors
in the VMs? 

We found some of the same issues when reweighting, backfilling and
removing large snapshots. By minimizing the number of concurrent
backfills and prioritizing client IO we can now add/remove OSDs without
the VMs throwing those nasty IO errors. 

We have been running a 3 node cluster for about a year now on Hammer
with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD
images. 

Here are the things we changed: 

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1' 

Recovery may take a little longer while backfilling, but the cluster is
still responsive and we have happy VMs now. 

I've collected these from various posts from the ceph-users list. 

Maybe they will help you if you haven't tried them already. 

Chris 

On 2016-04-07 4:18 am, Oliver Dzombic wrote: 

> Hi Christian,
> 
> thank you for answering, i appriciate your time !
> 
> ---
> 
> Its used for RBD hosted vm's and also cephfs hosted vm's.
> 
> Well the basic problem is/was that single OSD's simply go out/down.
> Ending in SATA BUS error's for the VM's which have to be rebooted, if
> they anyway can, because as long as OSD's are missing in that szenario,
> the customer cant start their vm's.
> 
> Installing/checking munin discovered a very high drive utilization. And
> this way simply an overload of the cluster.
> 
> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
> SSD for journal.
> 
> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
> for journal ). And, as first aid, reducing the replication from 3 to 2
> to reduce the (write) load of the cluster.
> 
> I planed to wait until the new LTS is out, but i already added now
> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
> tier cache ( changing strategy and increasing the number of drives while
> reducing the size - was an design mistake from me ).
> 
> osdmap e31602: 28 osds: 28 up, 28 in
> flags noscrub,nodeep-scrub
> pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
> 39270 GB used, 88290 GB / 124 TB avail
> 1428 active+clean
> 
> The range goes from 200 op/s to around 5000 op/s.
> 
> The current avarage drive utilization is 20-30%.
> 
> If we have backfill ( osd out/down ) or reweight the utilization of HDD
> drives is streight 90-100%.
> 
> Munin shows on all drives ( except the SSD's ) a dislatency of avarage
> 170 ms. A minumum of 80-130 ms, and a maximum of 300-600ms.
> 
> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
> are, together with most of the VM's in datacenter B.
> 
> I am currently cleaning the 4 initial nodes by doing
> 
> ceph osd reweight to peut a peut reducing the usage, to remove the osd's
> completely from there and just keeping up the monitors.
> 
> The complete cluster have to move to one single datacenter together with
> all VM's.
> 
> ---
> 
> I am reducing the number of nodes because out of administrative view,
> its not very handy. I prefere extending the hardware power in terms of
> CPU, RAM and HDD.
> 
> So the endcluster will look like:
> 
> 3x OSD Nodes, each:
> 
> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
> to connect to external JBOD servers holding the cold storage HDD's.
> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.
> 
> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
> very useful in a ceph environment. But then again, maybe with a cache
> tier the impact/difference is not really that big.
> 
> That together with Samsung SM863 240 GB SSD's for journal and cache
> tier, connected to the board directly or to a seperated Adaptec HBA
> 1000-16i.
> 
> So far the current idea/theory/plan.
> 
> ---
> 
> But to that point, its a long road. Last night i was doing a reweight of
> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
> had to restart the osd. ( with again IO errors in some of the vm's ).
> 
> So based on your article, the cache tier solved your problem, and i
> think i have basically the same.
> 
> ---
> 
> So a very good hint is, to activate the whole tier cache in the night,
> when things are a bit more smooth.
> 
> Any suggestions / critics / advices are highly welcome :-)
> 
> Thank you!
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Inter

[ceph-users] OSD mounts without BTRFS compression

2016-03-26 Thread Chris Murray
Hello all,

 

Please can someone offer some advice. In ceph.conf, I use:

 

osd_mkfs_type = btrfs

osd_mount_options_btrfs  =
noatime,nodiratime,compress-force=lzo

filestore btrfs snap= false

 

However, some of my OSDs are becoming much more full than others, as not
all are being mounted with the compress-force option. Is this a CEPH
issue or a BTRFS issue? Or other?

 

Take one host. Note that sdc1 is mounted twice, neither of which have
the compress-force option.

 

/dev/sdc1 on /var/lib/ceph/tmp/mnt.AywYKY type btrfs
(rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)

/dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

/dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs
(rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub
vol=/)

/dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

 

After a reboot, it's a sdd1 this time.

 

/dev/sdd1 on /var/lib/ceph/tmp/mnt.kWh2NA type btrfs
(rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)

/dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs
(rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub
vol=/)

/dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

/dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

 

Where should I look next?  I'm on 0.94.6

 

Thanks in advance,

Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-22 Thread Chris Dunlop
On Wed, Mar 23, 2016 at 01:22:45AM +0100, Loic Dachary wrote:
> On 23/03/2016 01:12, Chris Dunlop wrote:
>> On Wed, Mar 23, 2016 at 01:03:06AM +0100, Loic Dachary wrote:
>>> On 23/03/2016 00:39, Chris Dunlop wrote:
>>>> "The old OS'es" that were being supported up to v0.94.5 includes debian
>>>> wheezy. It would be quite surprising and unexpected to drop support for an
>>>> OS in the middle of a stable series.
>>>
>>> I'm unsure if wheezy is among the old OS'es. It predates my involvement in 
>>> the stable releases effort. I know for sure el6 and 12.04 are supported for 
>>> 0.94.x. 
>> 
>> From http://download.ceph.com/debian-hammer/pool/main/c/ceph/
>> 
>> ceph-common_0.94.1-1~bpo70+1_i386.deb  15-Dec-2015 15:32 
>>10217628
>> ceph-common_0.94.3-1~bpo70+1_amd64.deb 19-Oct-2015 18:54 
>> 9818964
>> ceph-common_0.94.4-1~bpo70+1_amd64.deb 26-Oct-2015 20:48 
>> 9868020
>> ceph-common_0.94.5-1~bpo70+1_amd64.deb 15-Dec-2015 15:32 
>> 9868188
>> 
>> That's all debian wheezy.
>> 
>> (Huh. I'd never noticed 0.94.1 was i386 only!)
>> 
> 
> Indeed. Were these packages created as a lucky side effect or because there 
> was a commitment at some point ? I'm curious to know the answer as well :-)

Who would know?  Sage?  (cc'ed)

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-22 Thread Chris Dunlop
Hi Loïc,

On Wed, Mar 23, 2016 at 01:03:06AM +0100, Loic Dachary wrote:
> On 23/03/2016 00:39, Chris Dunlop wrote:
>> "The old OS'es" that were being supported up to v0.94.5 includes debian
>> wheezy. It would be quite surprising and unexpected to drop support for an
>> OS in the middle of a stable series.
> 
> I'm unsure if wheezy is among the old OS'es. It predates my involvement in 
> the stable releases effort. I know for sure el6 and 12.04 are supported for 
> 0.94.x. 

From http://download.ceph.com/debian-hammer/pool/main/c/ceph/

ceph-common_0.94.1-1~bpo70+1_i386.deb  15-Dec-2015 15:32
10217628
ceph-common_0.94.3-1~bpo70+1_amd64.deb 19-Oct-2015 18:54
 9818964
ceph-common_0.94.4-1~bpo70+1_amd64.deb 26-Oct-2015 20:48
 9868020
ceph-common_0.94.5-1~bpo70+1_amd64.deb 15-Dec-2015 15:32
 9868188

That's all debian wheezy.

(Huh. I'd never noticed 0.94.1 was i386 only!)

Cheers,

Chris,
OnTheNet
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-22 Thread Chris Dunlop
Hi Loïc,

On Wed, Mar 23, 2016 at 12:14:27AM +0100, Loic Dachary wrote:
> On 22/03/2016 23:49, Chris Dunlop wrote:
>> Hi Stable Release Team for v0.94,
>> 
>> Let's try again... Any news on a release of v0.94.6 for debian wheezy 
>> (bpo70)?
> 
> I don't think publishing a debian wheezy backport for v0.94.6 is planned. 
> Maybe it's a good opportunity to initiate a community effort ? Would you like 
> to work with me on this ?

It's my understanding, from statements by both Sage and yourself, that
existing OS'es would continue to be supported in the stable series, e.g.:

 On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
 > I think you misread what Sage wrote : "The intention was to continue
 > building stable releases (0.94.x) on the old list of supported platforms
 > (which inclues 12.04 and el6)". In other words, the old OS'es are still
 > supported. Their absence is a glitch in the release process that will be
 > fixed.

"The old OS'es" that were being supported up to v0.94.5 includes debian
wheezy. It would be quite surprising and unexpected to drop support for an
OS in the middle of a stable series.

If that is indeed what's happening, and it's not just an oversight, I'd
prefer to put my efforts into moving to a supported OS rather than keeping
the older OS on life support.

Just to be clear, I understand it is quite a burden maintaining releases for
old OSes, I'm only voicing mild surprise and a touch of regret: I'm very
happy with the Ceph project!

Cheers,

Chris,
OnTheNet
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-22 Thread Chris Dunlop
Hi Stable Release Team for v0.94,

Let's try again... Any news on a release of v0.94.6 for debian wheezy (bpo70)?

Cheers,

Chris

On Thu, Mar 17, 2016 at 12:43:15PM +1100, Chris Dunlop wrote:
> Hi Chen,
> 
> On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote:
>> It’s already there, in 
>> http://download.ceph.com/debian-hammer/pool/main/c/ceph/.
> 
> I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would
> be bpo70.
> 
> Cheers,
> 
> Chris
> 
>> On 3/17/16, 7:20 AM, "Chris Dunlop" <ch...@onthe.net.au> wrote:
>> 
>>> Hi Stable Release Team for v0.94,
>>>
>>> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote:
>>>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
>>>>> I think you misread what Sage wrote : "The intention was to
>>>>> continue building stable releases (0.94.x) on the old list of
>>>>> supported platforms (which inclues 12.04 and el6)". In other
>>>>> words, the old OS'es are still supported. Their absence is a
>>>>> glitch in the release process that will be fixed.
>>>> 
>>>> Any news on a release of v0.94.6 for debian wheezy?
>>>
>>> Any news on a release of v0.94.6 for debian wheezy?
>>>
>>> Cheers,
>>>
>>> Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-19 Thread Chris Dunlop
Hi Stable Release Team for v0.94,

On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote:
> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
>> I think you misread what Sage wrote : "The intention was to
>> continue building stable releases (0.94.x) on the old list of
>> supported platforms (which inclues 12.04 and el6)". In other
>> words, the old OS'es are still supported. Their absence is a
>> glitch in the release process that will be fixed.
> 
> Any news on a release of v0.94.6 for debian wheezy?

Any news on a release of v0.94.6 for debian wheezy?

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-19 Thread Chris Dunlop
Hi Chen,

On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote:
> It’s already there, in 
> http://download.ceph.com/debian-hammer/pool/main/c/ceph/.

I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would
be bpo70.

Cheers,

Chris

> On 3/17/16, 7:20 AM, "Chris Dunlop" <ch...@onthe.net.au> wrote:
> 
>> Hi Stable Release Team for v0.94,
>>
>> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote:
>>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
>>>> I think you misread what Sage wrote : "The intention was to
>>>> continue building stable releases (0.94.x) on the old list of
>>>> supported platforms (which inclues 12.04 and el6)". In other
>>>> words, the old OS'es are still supported. Their absence is a
>>>> glitch in the release process that will be fixed.
>>> 
>>> Any news on a release of v0.94.6 for debian wheezy?
>>
>> Any news on a release of v0.94.6 for debian wheezy?
>>
>> Cheers,
>>
>> Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-09 Thread Chris Dunlop
Hi Loic,

On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
> I think you misread what Sage wrote : "The intention was to
> continue building stable releases (0.94.x) on the old list of
> supported platforms (which inclues 12.04 and el6)". In other
> words, the old OS'es are still supported. Their absence is a
> glitch in the release process that will be fixed.

Any news on a release of v0.94.6 for debian wheezy?

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restrict cephx commands

2016-03-01 Thread chris holcombe
Hey Ceph Users!

I'm wondering if it's possible to restrict the ceph keyring to only
being able to run certain commands.  I think the answer to this is no
but I just wanted to ask.  I haven't seen any documentation indicating
whether or not this is possible.  Anyone know?

Thanks,
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-01 Thread Chris Dunlop
Hi,

The "old list of supported platforms" includes debian wheezy.
Will v0.94.6 be built for this?

Chris

On Mon, Feb 29, 2016 at 10:57:53AM -0500, Sage Weil wrote:
> The intention was to continue building stable releases (0.94.x) on the old 
> list of supported platforms (which inclues 12.04 and el6).  I think it was 
> just an oversight that they weren't built this time around.  I the 
> overhead to doing so is just keeping a 12.04 and el6 jenkins build slave 
> around.
> 
> Doing this builds in the existing environment sounds much better than 
> trying to pull in externally built binaries...
> 
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Another corruption detection/correction question - exposure between 'event' and 'repair'?

2015-12-23 Thread Chris Murray
After messing up some of my data in the past (my own doing, playing with
BTRFS in old kernels), I've been extra cautious and now run a ZFS mirror
across multiple RBD images. It's led me to believe that I have a faulty
SSD in one of my hosts:

sdb without a journal - fine (but slow)
sdc without a journal - fine (but slow)
sdd without a journal - fine (but slow)

sdb with sda4 as journal - checksum errors appear in ZFS
sdc with sda5 as journal - checksum errors appear in ZFS
sdd with sda6 as journal - checksum errors appear in ZFS

So, I believe the SSD in sda is in some way defective, but my question
is around the detection and correction of this 'corruption'.

"nodeep-scrub flag(s) set" currently, due to the performance impact.
But, if it were set, it seems to find problems, which I can then repair.
However ... is this a safe repair, using a good copy each object? Will
it be with NewStore? I still seem to get errors regularly bubbling their
way up into ZFS, but I can't reliably ascertain whether they're the
result of a corruption which has happened *before* the next Ceph deep
scrub (therefore still exposed anyway in this timeframe?), or is *after*
a repair?

I'm obviously hoping for an eventual scenario where this is all
transparent to the ZFS layer and it stops detecting checksum errors :)

Thanks,
Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg stuck in peering state

2015-12-18 Thread Chris Dunlop
Hi Reno,

"Peering", as far as I understand it, is the osds trying to talk to each
other.

You have approximately 1 OSD worth of pgs stuck (i.e. 264 / 8), and osd.0
appears in each of the stuck pgs, alongside either osd.2 or osd.3.

I'd start by checking the comms between osd.0 and osds 2 and 3 (including
the MTU).

Cheers,

Chris


On Fri, Dec 18, 2015 at 02:50:18PM +0100, Reno Rainz wrote:
> Hi all,
> 
> I reboot all my osd node after, I got some pg stuck in peering state.
> 
> root@ceph-osd-3:/var/log/ceph# ceph -s
> cluster 186717a6-bf80-4203-91ed-50d54fe8dec4
>  health HEALTH_WARN
> clock skew detected on mon.ceph-osd-2
> 33 pgs peering
> 33 pgs stuck inactive
> 33 pgs stuck unclean
> Monitor clock skew detected
>  monmap e1: 3 mons at {ceph-osd-1=
> 10.200.1.11:6789/0,ceph-osd-2=10.200.1.12:6789/0,ceph-osd-3=10.200.1.13:6789/0
> }
> election epoch 14, quorum 0,1,2 ceph-osd-1,ceph-osd-2,ceph-osd-3
>  osdmap e66: 8 osds: 8 up, 8 in
>   pgmap v1346: 264 pgs, 3 pools, 272 MB data, 653 objects
> 808 MB used, 31863 MB / 32672 MB avail
>  231 active+clean
>   33 peering
> root@ceph-osd-3:/var/log/ceph#
> 
> 
> root@ceph-osd-3:/var/log/ceph# ceph pg dump_stuck
> ok
> pg_stat state up up_primary acting acting_primary
> 4.2d peering [2,0] 2 [2,0] 2
> 1.57 peering [3,0] 3 [3,0] 3
> 1.24 peering [3,0] 3 [3,0] 3
> 1.52 peering [0,2] 0 [0,2] 0
> 1.50 peering [2,0] 2 [2,0] 2
> 1.23 peering [3,0] 3 [3,0] 3
> 4.54 peering [2,0] 2 [2,0] 2
> 4.19 peering [3,0] 3 [3,0] 3
> 1.4b peering [0,3] 0 [0,3] 0
> 1.49 peering [0,3] 0 [0,3] 0
> 0.17 peering [0,3] 0 [0,3] 0
> 4.17 peering [0,3] 0 [0,3] 0
> 4.16 peering [0,3] 0 [0,3] 0
> 0.10 peering [0,3] 0 [0,3] 0
> 1.11 peering [0,2] 0 [0,2] 0
> 4.b peering [0,2] 0 [0,2] 0
> 1.3c peering [0,3] 0 [0,3] 0
> 0.c peering [0,3] 0 [0,3] 0
> 1.3a peering [3,0] 3 [3,0] 3
> 0.38 peering [2,0] 2 [2,0] 2
> 1.39 peering [0,2] 0 [0,2] 0
> 4.33 peering [2,0] 2 [2,0] 2
> 4.62 peering [2,0] 2 [2,0] 2
> 4.3 peering [0,2] 0 [0,2] 0
> 0.6 peering [0,2] 0 [0,2] 0
> 0.4 peering [2,0] 2 [2,0] 2
> 0.3 peering [2,0] 2 [2,0] 2
> 1.60 peering [0,3] 0 [0,3] 0
> 0.2 peering [3,0] 3 [3,0] 3
> 4.6 peering [3,0] 3 [3,0] 3
> 1.30 peering [0,3] 0 [0,3] 0
> 1.2f peering [0,2] 0 [0,2] 0
> 1.2a peering [3,0] 3 [3,0] 3
> root@ceph-osd-3:/var/log/ceph#
> 
> 
> root@ceph-osd-3:/var/log/ceph# ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -9 4.0 root default
> -8 4.0 region eu-west-1
> -6 2.0 datacenter eu-west-1a
> -2 2.0 host ceph-osd-1
>  0 1.0 osd.0  up  1.0  1.0
>  1 1.0 osd.1  up  1.0  1.0
> -4 2.0 host ceph-osd-3
>  4 1.0 osd.4  up  1.0  1.0
>  5 1.0 osd.5  up  1.0  1.0
> -7 2.0 datacenter eu-west-1b
> -3 2.0 host ceph-osd-2
>  2 1.0 osd.2  up  1.0  1.0
>  3 1.0 osd.3  up  1.0  1.0
> -5 2.0 host ceph-osd-4
>  6 1.0 osd.6  up  1.0  1.0
>  7 1.0 osd.7  up  1.0  1.0
> root@ceph-osd-3:/var/log/ceph#
> 
> Do you have guys any idea ? Why they stay in this state ?

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploying a Ceph storage cluster using Warewulf on Centos-7

2015-12-17 Thread Chris Jones
Hi Chu,

If you can use Chef then:
https://github.com/ceph/ceph-chef

An example of an actual project can be found at:
https://github.com/bloomberg/chef-bcs

Chris

On Wed, Sep 23, 2015 at 4:11 PM, Chu Ruilin <ruilin...@gmail.com> wrote:

> Hi, all
>
> I don't know which automation tool is best for deploying Ceph and I'd like
> to know about. I'm comfortable with Warewulf since I've been using it for
> HPC clusters. I find it quite convenient for Ceph too. I wrote a set of
> scripts that can deploy a Ceph cluster quickly. Here is how I did it just
> using virtualbox:
>
>
> http://ruilinchu.blogspot.com/2015/09/deploying-ceph-storage-cluster-using.html
>
> comments are welcome!
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Best Regards,
Chris Jones

cjo...@cloudm2.com
(p) 770.655.0770
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs: large files hang

2015-12-17 Thread Chris Dunlop
Hi Bryan,

Have you checked your MTUs? I was recently bitten by large packets not
getting through where small packets would. (This list, Dec 14, "All pgs
stuck peering".) Small files working but big files not working smells 
like it could be a similar problem.

Cheers,

Chris

On Thu, Dec 17, 2015 at 07:43:54PM +, Bryan Wright wrote:
> Hi folks,
> 
> This is driving me crazy.  I have a ceph filesystem that behaves normally
> when I "ls" files, and behaves normally when I copy smallish files on or off
> of the filesystem, but large files (~ GB size) hang after copying a few
> megabytes.
> 
> This is ceph 0.94.5 under Centos 6.7 under kernel 4.3.3-1.el6.elrepo.x86_64.
>  I've tried 64-bit and 32-bit clients with several different kernels, but
> all behave the same.
> 
> After copying the first few bytes I get a stream of "slow request" messages
> for the osds, like this:
> 
> 2015-12-17 14:20:40.458306 osd.208 [WRN] slow request 1922.166564 seconds
> old, received at 2015-12-17 13:48:38.291683: osd_op(mds.0.14956:851
> 100010a7b92.000d [stat] 0.5d427a9a RETRY=5
> ack+retry+read+rwordered+known_if_redirected e193868) currently reached_pg
> 
> It's not a single OSD misbehaving.  It seems to be any OSD.   The OSDs have
> plenty of disk space, and there's nothing in the osd logs that points to a
> problem.
> 
> How can I find out what's blocking these requests?
> 
> Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pgs stuck peering

2015-12-14 Thread Chris Dunlop
On Mon, Dec 14, 2015 at 09:29:20PM +0800, Jaze Lee wrote:
> Should we add big packet test in heartbeat? Right now the heartbeat
> only test the little packet. If the MTU is mismatched, the heartbeat
> can not find that.

It would certainly have saved me a great deal of stress!

I imagine you wouldn't want it doing a big packet test every
heartbeat, perhaps every 10th or some configurable number.

Something for the developers to consider? (cc'ed)

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] All pgs stuck peering

2015-12-13 Thread Chris Dunlop
Hi,

ceph 0.94.5

After restarting one of our three osd hosts to increase the RAM and change
from linux 3.18.21 to 4.1., the cluster is stuck with all pgs peering:

# ceph -s
cluster c6618970-0ce0-4cb2-bc9a-dd5f29b62e24
 health HEALTH_WARN
3072 pgs peering
3072 pgs stuck inactive
3072 pgs stuck unclean
1450 requests are blocked > 32 sec
noout flag(s) set
 monmap e9: 3 mons at 
{b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}
election epoch 74462, quorum 0,1,2 b2,b4,b5
 osdmap e356963: 59 osds: 59 up, 59 in
flags noout
  pgmap v69385733: 3072 pgs, 3 pools, 11973 GB data, 3340 kobjects
31768 GB used, 102 TB / 133 TB avail
3072 peering

What can I do to diagnose (or better yet, fix!) this?

Downgrading back to 3.18.21 hasn't helped.

Each host (now) has 192G RAM. One has 17 osds, the other two have 21 osds
each.

I can see there's traffic going between the osd ports on the various osd
hosts, but all small packets (122 or 131 bytes).

Just prior to upgrading this osd host another one had also been upgraded
(RAM + linux). The cluster had no trouble at that point and was healthy
within a few minutes of that server starting up.

The cluster has been working fine for years up to now, having had rolling
upgrades since dumpling.

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pgs stuck peering

2015-12-13 Thread Chris Dunlop
  "up_primary": 0
},
{
"first": 356959,
"last": 356963,
"maybe_went_rw": 1,
"up": [
6,
0
],
"acting": [
6,
0
],
"primary": 6,
"up_primary": 6
},
{
"first": 356964,
"last": 357025,
"maybe_went_rw": 1,
"up": [
0
],
"acting": [
0
],
"primary": 0,
"up_primary": 0
},
{
"first": 357026,
"last": 357026,
"maybe_went_rw": 0,
"up": [],
"acting": [],
"primary": -1,
"up_primary": -1
},
{
"first": 357027,
"last": 357041,
"maybe_went_rw": 1,
"up": [
0
],
"acting": [
0
],
"primary": 0,
    "up_primary": 0
},
{
"first": 357042,
"last": 357081,
"maybe_went_rw": 1,
"up": [
6,
0
],
"acting": [
6,
0
],
"primary": 6,
"up_primary": 6
},
{
"first": 357082,
"last": 357082,
"maybe_went_rw": 0,
"up": [
6
],
"acting": [
6
],
"primary": 6,
"up_primary": 6
},
{
"first": 357083,
"last": 357088,
"maybe_went_rw": 0,
"up": [
6,
0
],
"acting": [
6,
0
],
"primary": 6,
"up_primary": 6
},
{
"first": 357089,
"last": 357089,
"maybe_went_rw": 0,
"up": [
0
],
"acting": [
0
],
"primary": 0,
"up_primary": 0
},
{
"first": 357090,
"last": 357167,
"maybe_went_rw": 1,
"up": [
6,
0
],
"acting": [
6,
0
],
"primary": 6,
"up_primary": 6
},
{
"first": 357168,
"last": 357217,
"maybe_went_rw": 1,
"up": [
0
],
"acting": [
0
],
"primary": 0,
"up_primary": 0
}
],
"probing_osds": [
"0",
"6"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2015-12-14 12:54:41.084717"
}
],
"agent_state": {}
}


Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >