Re: [ceph-users] Ceph memory overhead when used with KVM

2017-05-16 Thread Jason Dillaman
Sorry, I haven't had a chance to attempt to reproduce.

I do know that the librbd in-memory cache does not restrict incoming
IO to the cache size while in-flight. Therefore, if you are performing
4MB writes with a queue depth of 256, you might see up to 1GB of
memory allocated from the heap for handling the cache.

QEMU would also duplicate the IO memory for a bounce buffer
(eliminated in the latest version of QEMU and librbd) and librbd
copies the IO memory again to ensure ownership (known issue we would
like to solve) -- that would account for an additional 2GB of memory
allocations under this scenario.

These would just be a transient spike of heap usage while the IO is
in-flight, but since I'm pretty sure the default behavior of the glibc
allocator does not return slabs to the OS, I would expect high memory
overhead to remain for the life of the process.

Please feel free to open a tracker ticket here [1] and I can look into
it when I get some time.

[1] http://tracker.ceph.com/projects/rbd/issues

On Tue, May 16, 2017 at 2:52 AM, nick  wrote:
> Hi Jason,
> did you have some time to check if you can reproduce the high memory usage? I
> am not sure if I should create a bug report for this or if this is expected
> behaviour.
>
> Cheers
> Nick
>
> On Monday, May 08, 2017 08:55:55 AM you wrote:
>> Thanks. One more question: was the image a clone or a stand-alone image?
>>
>> On Fri, May 5, 2017 at 2:42 AM, nick  wrote:
>> > Hi,
>> > I used one of the fio example files and changed it a bit:
>> >
>> > """
>> > # This job file tries to mimic the Intel IOMeter File Server Access
>> > Pattern
>> > [global]
>> > description=Emulation of Intel IOmeter File Server Access Pattern
>> > randrepeat=0
>> > filename=/root/test.dat
>> > # IOMeter defines the server loads as the following:
>> > # iodepth=1 Linear
>> > # iodepth=4 Very Light
>> > # iodepth=8 Light
>> > # iodepth=64Moderate
>> > # iodepth=256   Heavy
>> > iodepth=8
>> > size=80g
>> > direct=0
>> > ioengine=libaio
>> >
>> > [iometer]
>> > stonewall
>> > bs=4M
>> > rw=randrw
>> >
>> > [iometer_just_write]
>> > stonewall
>> > bs=4M
>> > rw=write
>> >
>> > [iometer_just_read]
>> > stonewall
>> > bs=4M
>> > rw=read
>> > """
>> >
>> > Then let it run:
>> > $> while true; do fio stress.fio; rm /root/test.dat; done
>> >
>> > I had this running over a weekend.
>> >
>> > Cheers
>> > Sebastian
>> >
>> > On Tuesday, May 02, 2017 02:51:06 PM Jason Dillaman wrote:
>> >> Can you share the fio job file that you utilized so I can attempt to
>> >> repeat locally?
>> >>
>> >> On Tue, May 2, 2017 at 2:51 AM, nick  wrote:
>> >> > Hi Jason,
>> >> > thanks for your feedback. I did now some tests over the weekend to
>> >> > verify
>> >> > the memory overhead.
>> >> > I was using qemu 2.8 (taken from the Ubuntu Cloud Archive) with librbd
>> >> > 10.2.7 on Ubuntu 16.04 hosts. I suspected the ceph rbd cache to be the
>> >> > cause of the overhead so I just generated a lot of IO with the help of
>> >> > fio in the VMs (with a datasize of 80GB) . All VMs had 3GB of memory. I
>> >> > had to run fio multiple times, before reaching high RSS values.
>> >> > I also noticed that when using larger blocksizes during writes (like
>> >> > 4M)
>> >> > the memory overhead in the KVM process increased faster.
>> >> > I ran several fio tests (one after another) and the results are:
>> >> >
>> >> > KVM with writeback RBD cache: max. 85% memory overhead (2.5 GB
>> >> > overhead)
>> >> > KVM with writethrough RBD cache: max. 50% memory overhead
>> >> > KVM without RBD caching: less than 10% overhead all the time
>> >> > KVM with local storage (logical volume used): 8% overhead all the time
>> >> >
>> >> > I did not reach those >200% memory overhead results that we see on our
>> >> > live
>> >> > cluster, but those virtual machines have a way longer uptime as well.
>> >> >
>> >> > I also tried to reduce the RSS memory value with cache dropping on the
>> >> > physical host and in the VM. Both did not lead to any change. A reboot
>> >> > of
>> >> > the VM also does not change anything (reboot in the VM, not a new KVM
>> >> > process). The only way to reduce the RSS memory value is a live
>> >> > migration
>> >> > so far. Might this be a bug? The memory overhead sounds a bit too much
>> >> > for me.
>> >> >
>> >> > Best Regards
>> >> > Sebastian
>> >> >
>> >> > On Thursday, April 27, 2017 10:08:36 AM you wrote:
>> >> >> I know we noticed high memory usage due to librados in the Ceph
>> >> >> multipathd checker [1] -- the order of hundreds of megabytes. That
>> >> >> client was probably nearly as trivial as an application can get and I
>> >> >> just assumed it was due to large monitor maps being sent to the client
>> >> >> for whatever reason. Since we changed course on our RBD iSCSI
>> >> >> implementation, unfortunately the investigation into this high memory
>> >> >> usage fell by the wayside.
>> >> >>
>> >> >> [1]
>> >> >> http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=blob;f

Re: [ceph-users] Ceph memory overhead when used with KVM

2017-05-16 Thread nick
Thanks for the explanation. I will create a ticket on the tracker then.

Cheers
Nick

On Tuesday, May 16, 2017 08:16:33 AM Jason Dillaman wrote:
> Sorry, I haven't had a chance to attempt to reproduce.
> 
> I do know that the librbd in-memory cache does not restrict incoming
> IO to the cache size while in-flight. Therefore, if you are performing
> 4MB writes with a queue depth of 256, you might see up to 1GB of
> memory allocated from the heap for handling the cache.
> 
> QEMU would also duplicate the IO memory for a bounce buffer
> (eliminated in the latest version of QEMU and librbd) and librbd
> copies the IO memory again to ensure ownership (known issue we would
> like to solve) -- that would account for an additional 2GB of memory
> allocations under this scenario.
> 
> These would just be a transient spike of heap usage while the IO is
> in-flight, but since I'm pretty sure the default behavior of the glibc
> allocator does not return slabs to the OS, I would expect high memory
> overhead to remain for the life of the process.
> 
> Please feel free to open a tracker ticket here [1] and I can look into
> it when I get some time.
> 
> [1] http://tracker.ceph.com/projects/rbd/issues
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon Cancelled

2017-05-16 Thread Lars Marowsky-Bree
On 2017-05-15T15:21:47, Danny Al-Gaaf  wrote:

> What about moving the event to the next OpenStack Summit in Sydney, let
> say directly following the Summit. Not sure how many people already plan
> to go to Sydney from the Ceph Community, but this way the 10-20 h flight
> (depending on your location) would probably make much more sense, than
> for 3 days only.

I'd not be opposed to this, but I know that the travel budget to
Australia would severely impact the number of people a number of
organizations would be able to send. Alas.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] sortbitwise warning broken on Ceph Jewel?

2017-05-16 Thread Fabian Grünbichler
The Kraken release notes[1] contain the following note about the
sortbitwise flag and upgrading from <= Jewel to > Jewel:

The sortbitwise flag must be set on the Jewel cluster before upgrading
to Kraken. The latest Jewel (10.2.4+) releases issue a health warning if
the flag is not set, so this is probably already set. If it is not,
Kraken OSDs will refuse to start and will print and error message in
their log.

I think this refers to the warning introduced by d3dbd8581 [2], which
is triggered if
- a mon config key is set to true (default, not there in master anymore)
- the sortbitwise flag is not set (default for clusters upgrading from
  hammer, not default for new jewel clusters)
- the OSDs support sortbitwise (I assume this is the default for Jewel
  OSDs? I am not sure how to get this information from a running OSD?)

I have not been able to trigger this warning for either an upgraded
Hammer cluster (all nodes upgraded from latest Hammer to latest Jewel
and rebooted) which does not have sortbitwise set, nor for a freshly
installed Jewel cluster where I manually unset sortbitwise and rebooted
afterwards. Am I doing something wrong, or is the check somehow broken?
If the latter is the case, the release notes are very misleading (as
users will probably rely on "no health warning -> safe to upgrade").

I also see one follow-up fix[3] which was only included in Kraken so
far, but AFAICT this should only possible affect the second test with a
manually unset sortbitwise on Jewel, and not the Hammer -> Jewel ->
Kraken/Luminous upgrade path.

1: http://docs.ceph.com/docs/master/release-notes/#upgrading-from-jewel
2: https://github.com/ceph/ceph/commit/d3dbd8581bd39572dc55d4953b5d8c49255426d7
3: https://github.com/ceph/ceph/pull/12682

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
 wrote:
> 3.) it still happens on pre jewel images even when they got restarted /
> killed and reinitialized. In that case they've the asok socket available
> for now. Should i issue any command to the socket to get log out of the
> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.

The best option would be to run "gcore" against the running VM whose
IO is stuck, compress the dump, and use the "ceph-post-file" to
provide the dump. I could then look at all the Ceph data structures to
hopefully find the issue.

Enabling debug logs after the IO has stuck will most likely be of
little value since it won't include the details of which IOs are
outstanding. You could attempt to use "ceph --admin-daemon
/path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
stuck waiting on an OSD to respond.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Odd cyclical cluster performance

2017-05-16 Thread Patrick Dinnen
Hi Greg,

It's definitely not scrub or deep-scrub, as those are switched off for
testing. Anything else you'd look at as a possible culprit here?

Thanks, Patrick

On Mon, May 15, 2017 at 5:51 PM, Gregory Farnum  wrote:
> Did you try correlating it with PG scrubbing or other maintenance behaviors?
> -Greg
>
> On Thu, May 11, 2017 at 12:47 PM, Patrick Dinnen  wrote:
>> Seeing some odd behaviour while testing using rados bench. This is on
>> a pre-split pool, two node cluster with 12 OSDs total.
>>
>> ceph osd pool create newerpoolofhopes 2048 2048 replicated ""
>> replicated_ruleset 5
>>
>> rados -p newerpoolofhopes bench -t 32 -b 2 3000 write --no-cleanup
>>
>> Using Prometheus/Grafana to watch what's going on, we see oddly
>> regular peaks and dips in writer performance. The frequency changes
>> gradually but it's on the order of hours (not the seconds that might
>> seem easier to explain by system phenomena). It starts off at roughly
>> one cycle per hour and we've seen it for multiple days of constant
>> bench running with nothing else happening on the cluster.
>>
>> A bunch of graphs showing the pattern:
>>
>> https://ibb.co/djXUVk
>> https://ibb.co/gMNk35
>> https://ibb.co/iKViqk
>> https://ibb.co/jOXJO5
>> https://ibb.co/isUMbQ
>>
>> sdg and sdi are SSD journal disks. The activity on the OSDs and SSDs
>> seems anti-correlated. SSDs peak in activity as OSDs reach the bottom
>> of the trough. Then the reverse. Repeat.
>>
>> Does anyone have any suggestions as to what could possibly be causing
>> a regular pattern like this at such a low frequency?
>>
>> Thanks, Patrick Dinnen
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] S3 API with Keystone auth

2017-05-16 Thread Mārtiņš Jakubovičs

Hello all,

Just entered to object storage world and set up working cluster for 
RadosGW and authentication using OpenStack Keystone. Swift API works 
great, but how to test S3 API? I mean, I find a way to test with python 
boto, but looks like I am missing aws_access_key_id, how to get it? Or 
should it be like "Project Name:Username"?


CEPH 10.2.7

Thanks and best regards,

Martins

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Failed to start Ceph disk activation: /dev/dm-18

2017-05-16 Thread Kevin Olbrich
HI!

Currently I am deploying a small cluster with two nodes. I installed ceph
jewel on all nodes and made a basic deployment.
After "ceph osd create..." I am now getting "Failed to start Ceph disk
activation: /dev/dm-18" on boot. All 28 OSDs were never active.
This server has a 14 disk JBOD with 4x fiber using multipath (4x active
multibus). We have two servers.

OS: Latest CentOS 7

[root@osd01 ~]# ceph -v
> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)


Command run:

> ceph-deploy osd create
> osd01.example.local:/dev/mapper/mpatha:/dev/disk/by-partlabel/journal01


There is no error in journalctl, just that the unit failed:

> May 16 16:47:33 osd01.example.local systemd[1]: Failed to start Ceph disk
> activation: /dev/dm-27.
> May 16 16:47:33 osd01.example.local systemd[1]: 
> ceph-disk@dev-dm\x2d27.service:
> main process exited, code=exited, status=124/n/a
> May 16 16:47:33 osd01.example.local systemd[1]: ceph-disk@dev-dm\x2d24.service
> failed.
> May 16 16:47:33 osd01.example.local systemd[1]: Unit 
> ceph-disk@dev-dm\x2d24.service
> entered failed state.


[root@osd01 ~]# gdisk -l /dev/mapper/mpatha
> GPT fdisk (gdisk) version 0.8.6
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/mapper/mpatha: 976642095 sectors, 465.7 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): DEF0B782-3B7F-4AF5-A0CB-9E2B96C40B13
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 976642061
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
> Number  Start (sector)End (sector)  Size   Code  Name
>12048   976642061   465.7 GiB     ceph data


I had problems with multipath in the past when running ceph but this time I
was unable to solve the problem.
Any ideas?

Kind regards,
Kevin.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hammer to Jewel upgrade questions

2017-05-16 Thread Shain Miley

Hello,

I am going to be upgrading our production Ceph cluster from 
Hammer/Ubuntu 14.04 to Jewel/Ubuntu 16.04 and I wanted to ask a question 
and sanity check my upgrade plan.


Here are the steps I am planning to take during the upgrade:

1)Upgrade to latest hammer on current cluster
2)Remove or rename the existing ‘ceph’ user and ‘ceph’ group on each node
3)Upgrade the ceph packages to latest Jewel (mon, then osd, then rbd 
clients)

4)stop ceph daemons
5)change permissions on ceph directories and osd journals:

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -type d|parallel chown -R 
64045:64045

chown 64045:64045 /var/lib/ceph
chown 64045:64045 /var/lib/ceph/*
chown 64045:64045 /var/lib/ceph/bootstrap-*/*

for ID in $(ls /var/lib/ceph/osd/|cut -d '-' -f 2); do
JOURNAL=$(readlink -f /var/lib/ceph/osd/ceph-${ID}/journal)
chown ceph ${JOURNAL}

6)restart ceph daemons

The two questions I have are:
1)Am I missing anything from the steps above...based on prior 
experiences performing upgrades of this kind?


2)Should I upgrade to Ubuntu 16.04 first and then upgrade Ceph...or vice 
versa?


Thanks in advance,
Shain

--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org | 
202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello Jason,

i'm happy to tell you that i've currently one VM where i can reproduce
the problem.

> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.

I've saved the dump but it will contain sensitive informations. I won't
upload it to a public server. I'll send you an private email with a
private server to download the core dump. Thanks!

> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.

This is the output:
# ceph --admin-daemon
/var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests
{
"ops": [
{
"tid": 384632,
"pg": "5.bd9616ad",
"osd": 46,
"object_id": "rbd_data.e10ca56b8b4567.311c",
"object_locator": "@5",
"target_object_id": "rbd_data.e10ca56b8b4567.311c",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "2.28554e+06s",
"attempts": 1,
"snapid": "head",
"snap_context": "a07c2=[]",
"mtime": "2017-05-16 21:03:22.0.196102s",
"osd_ops": [
"delete"
]
}
],
"linger_ops": [
{
"linger_id": 1,
"pg": "5.5f3bd635",
"osd": 17,
"object_id": "rbd_header.e10ca56b8b4567",
"object_locator": "@5",
"target_object_id": "rbd_header.e10ca56b8b4567",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}

Greets,
Stefan

Am 16.05.2017 um 15:44 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
>  wrote:
>> 3.) it still happens on pre jewel images even when they got restarted /
>> killed and reinitialized. In that case they've the asok socket available
>> for now. Should i issue any command to the socket to get log out of the
>> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.
> 
> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.
> 
> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
Thanks for the update. In the ops dump provided, the objecter is
saying that OSD 46 hasn't responded to the deletion request of object
rbd_data.e10ca56b8b4567.311c.

Perhaps run "ceph daemon osd.46 dump_ops_in_flight" or "...
dump_historic_ops" to see if that op is in the list? You can also run
"ceph osd map  rbd_data.e10ca56b8b4567.311c" to
verify that OSD 46 is the primary PG for that object.

On Tue, May 16, 2017 at 3:14 PM, Stefan Priebe - Profihost AG
 wrote:
> Hello Jason,
>
> i'm happy to tell you that i've currently one VM where i can reproduce
> the problem.
>
>> The best option would be to run "gcore" against the running VM whose
>> IO is stuck, compress the dump, and use the "ceph-post-file" to
>> provide the dump. I could then look at all the Ceph data structures to
>> hopefully find the issue.
>
> I've saved the dump but it will contain sensitive informations. I won't
> upload it to a public server. I'll send you an private email with a
> private server to download the core dump. Thanks!
>
>> Enabling debug logs after the IO has stuck will most likely be of
>> little value since it won't include the details of which IOs are
>> outstanding. You could attempt to use "ceph --admin-daemon
>> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
>> stuck waiting on an OSD to respond.
>
> This is the output:
> # ceph --admin-daemon
> /var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests
> {
> "ops": [
> {
> "tid": 384632,
> "pg": "5.bd9616ad",
> "osd": 46,
> "object_id": "rbd_data.e10ca56b8b4567.311c",
> "object_locator": "@5",
> "target_object_id": "rbd_data.e10ca56b8b4567.311c",
> "target_object_locator": "@5",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "last_sent": "2.28554e+06s",
> "attempts": 1,
> "snapid": "head",
> "snap_context": "a07c2=[]",
> "mtime": "2017-05-16 21:03:22.0.196102s",
> "osd_ops": [
> "delete"
> ]
> }
> ],
> "linger_ops": [
> {
> "linger_id": 1,
> "pg": "5.5f3bd635",
> "osd": 17,
> "object_id": "rbd_header.e10ca56b8b4567",
> "object_locator": "@5",
> "target_object_id": "rbd_header.e10ca56b8b4567",
> "target_object_locator": "@5",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "snapid": "head",
> "registered": "1"
> }
> ],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
>
> Greets,
> Stefan
>
> Am 16.05.2017 um 15:44 schrieb Jason Dillaman:
>> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>> 3.) it still happens on pre jewel images even when they got restarted /
>>> killed and reinitialized. In that case they've the asok socket available
>>> for now. Should i issue any command to the socket to get log out of the
>>> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.
>>
>> The best option would be to run "gcore" against the running VM whose
>> IO is stuck, compress the dump, and use the "ceph-post-file" to
>> provide the dump. I could then look at all the Ceph data structures to
>> hopefully find the issue.
>>
>> Enabling debug logs after the IO has stuck will most likely be of
>> little value since it won't include the details of which IOs are
>> outstanding. You could attempt to use "ceph --admin-daemon
>> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
>> stuck waiting on an OSD to respond.
>>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello Jason,

Am 16.05.2017 um 21:32 schrieb Jason Dillaman:
> Thanks for the update. In the ops dump provided, the objecter is
> saying that OSD 46 hasn't responded to the deletion request of object
> rbd_data.e10ca56b8b4567.311c.
> 
> Perhaps run "ceph daemon osd.46 dump_ops_in_flight" or "...
> dump_historic_ops" to see if that op is in the list?

We've enabled the op tracker for performance reasons while using SSD
only storage ;-(

Can enable the op tracker using ceph osd tell? Than reproduce the
problem. Check what has stucked again? Or should i generate an rbd log
from the client?

> You can also run
> "ceph osd map  rbd_data.e10ca56b8b4567.311c" to
> verify that OSD 46 is the primary PG for that object.

Yes it is:
osdmap e886758 pool 'cephstor1' (5) object
'rbd_data.e10ca56b8b4567.311c' -> pg 5.bd9616ad (5.6ad) ->
up ([46,29,30], p46) acting ([46,29,30], p46)

Greets,
Stefan

> On Tue, May 16, 2017 at 3:14 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello Jason,
>>
>> i'm happy to tell you that i've currently one VM where i can reproduce
>> the problem.
>>
>>> The best option would be to run "gcore" against the running VM whose
>>> IO is stuck, compress the dump, and use the "ceph-post-file" to
>>> provide the dump. I could then look at all the Ceph data structures to
>>> hopefully find the issue.
>>
>> I've saved the dump but it will contain sensitive informations. I won't
>> upload it to a public server. I'll send you an private email with a
>> private server to download the core dump. Thanks!
>>
>>> Enabling debug logs after the IO has stuck will most likely be of
>>> little value since it won't include the details of which IOs are
>>> outstanding. You could attempt to use "ceph --admin-daemon
>>> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
>>> stuck waiting on an OSD to respond.
>>
>> This is the output:
>> # ceph --admin-daemon
>> /var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests
>> {
>> "ops": [
>> {
>> "tid": 384632,
>> "pg": "5.bd9616ad",
>> "osd": 46,
>> "object_id": "rbd_data.e10ca56b8b4567.311c",
>> "object_locator": "@5",
>> "target_object_id": "rbd_data.e10ca56b8b4567.311c",
>> "target_object_locator": "@5",
>> "paused": 0,
>> "used_replica": 0,
>> "precalc_pgid": 0,
>> "last_sent": "2.28554e+06s",
>> "attempts": 1,
>> "snapid": "head",
>> "snap_context": "a07c2=[]",
>> "mtime": "2017-05-16 21:03:22.0.196102s",
>> "osd_ops": [
>> "delete"
>> ]
>> }
>> ],
>> "linger_ops": [
>> {
>> "linger_id": 1,
>> "pg": "5.5f3bd635",
>> "osd": 17,
>> "object_id": "rbd_header.e10ca56b8b4567",
>> "object_locator": "@5",
>> "target_object_id": "rbd_header.e10ca56b8b4567",
>> "target_object_locator": "@5",
>> "paused": 0,
>> "used_replica": 0,
>> "precalc_pgid": 0,
>> "snapid": "head",
>> "registered": "1"
>> }
>> ],
>> "pool_ops": [],
>> "pool_stat_ops": [],
>> "statfs_ops": [],
>> "command_ops": []
>> }
>>
>> Greets,
>> Stefan
>>
>> Am 16.05.2017 um 15:44 schrieb Jason Dillaman:
>>> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
>>>  wrote:
 3.) it still happens on pre jewel images even when they got restarted /
 killed and reinitialized. In that case they've the asok socket available
 for now. Should i issue any command to the socket to get log out of the
 hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.
>>>
>>> The best option would be to run "gcore" against the running VM whose
>>> IO is stuck, compress the dump, and use the "ceph-post-file" to
>>> provide the dump. I could then look at all the Ceph data structures to
>>> hopefully find the issue.
>>>
>>> Enabling debug logs after the IO has stuck will most likely be of
>>> little value since it won't include the details of which IOs are
>>> outstanding. You could attempt to use "ceph --admin-daemon
>>> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
>>> stuck waiting on an OSD to respond.
>>>
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
 wrote:
> We've enabled the op tracker for performance reasons while using SSD
> only storage ;-(

Disabled you mean?

> Can enable the op tracker using ceph osd tell? Than reproduce the
> problem. Check what has stucked again? Or should i generate an rbd log
> from the client?

>From a super-quick glance at the code, it looks like that isn't a
dynamic setting. Of course, it's possible that if you restart OSD 46
to enable the op tracker, the stuck op will clear itself and the VM
will resume. You could attempt to generate a gcore of OSD 46 to see if
information on that op could be extracted via the debugger, but no
guarantees.

You might want to verify that the stuck client and OSD 46 have an
actual established TCP connection as well before doing any further
actions.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>  wrote:
>> We've enabled the op tracker for performance reasons while using SSD
>> only storage ;-(
> 
> Disabled you mean?
Sorry yes.

>> Can enable the op tracker using ceph osd tell? Than reproduce the
>> problem. Check what has stucked again? Or should i generate an rbd log
>> from the client?
> 
> From a super-quick glance at the code, it looks like that isn't a
> dynamic setting. Of course, it's possible that if you restart OSD 46
> to enable the op tracker, the stuck op will clear itself and the VM
> will resume.
Yes already tested this some time ago. This will resume all I/O.

> You could attempt to generate a gcore of OSD 46 to see if
> information on that op could be extracted via the debugger, but no
> guarantees.

Sorry no idea how todo that.

> You might want to verify that the stuck client and OSD 46 have an
> actual established TCP connection as well before doing any further
> actions.

Can check that. When i reproduce the issue.

Greets,
Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello,

while reproducing the problem, objecter_requests looks like this:

{
"ops": [
{
"tid": 42029,
"pg": "5.bd9616ad",
"osd": 46,
"object_id": "rbd_data.e10ca56b8b4567.311c",
"object_locator": "@5",
"target_object_id": "rbd_data.e10ca56b8b4567.311c",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "2.28854e+06s",
"attempts": 1,
"snapid": "head",
"snap_context": "a07c2=[]",
"mtime": "2017-05-16 21:53:22.0.069541s",
"osd_ops": [
"delete"
]
}
],
"linger_ops": [
{
"linger_id": 1,
"pg": "5.5f3bd635",
"osd": 17,
"object_id": "rbd_header.e10ca56b8b4567",
"object_locator": "@5",
"target_object_id": "rbd_header.e10ca56b8b4567",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}

Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
a pcap file of the traffic between them when it got stuck.

Greets,
Stefan

Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>  wrote:
>> We've enabled the op tracker for performance reasons while using SSD
>> only storage ;-(
> 
> Disabled you mean?
> 
>> Can enable the op tracker using ceph osd tell? Than reproduce the
>> problem. Check what has stucked again? Or should i generate an rbd log
>> from the client?
> 
> From a super-quick glance at the code, it looks like that isn't a
> dynamic setting. Of course, it's possible that if you restart OSD 46
> to enable the op tracker, the stuck op will clear itself and the VM
> will resume. You could attempt to generate a gcore of OSD 46 to see if
> information on that op could be extracted via the debugger, but no
> guarantees.
> 
> You might want to verify that the stuck client and OSD 46 have an
> actual established TCP connection as well before doing any further
> actions.
> 


osd.46_qemu_2.pcap.gz
Description: GNU Zip compressed data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
It looks like it's just a ping message in that capture.

Are you saying that you restarted OSD 46 and the problem persisted?

On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
 wrote:
> Hello,
>
> while reproducing the problem, objecter_requests looks like this:
>
> {
> "ops": [
> {
> "tid": 42029,
> "pg": "5.bd9616ad",
> "osd": 46,
> "object_id": "rbd_data.e10ca56b8b4567.311c",
> "object_locator": "@5",
> "target_object_id": "rbd_data.e10ca56b8b4567.311c",
> "target_object_locator": "@5",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "last_sent": "2.28854e+06s",
> "attempts": 1,
> "snapid": "head",
> "snap_context": "a07c2=[]",
> "mtime": "2017-05-16 21:53:22.0.069541s",
> "osd_ops": [
> "delete"
> ]
> }
> ],
> "linger_ops": [
> {
> "linger_id": 1,
> "pg": "5.5f3bd635",
> "osd": 17,
> "object_id": "rbd_header.e10ca56b8b4567",
> "object_locator": "@5",
> "target_object_id": "rbd_header.e10ca56b8b4567",
> "target_object_locator": "@5",
> "paused": 0,
> "used_replica": 0,
> "precalc_pgid": 0,
> "snapid": "head",
> "registered": "1"
> }
> ],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
>
> Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
> a pcap file of the traffic between them when it got stuck.
>
> Greets,
> Stefan
>
> Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
>> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>>  wrote:
>>> We've enabled the op tracker for performance reasons while using SSD
>>> only storage ;-(
>>
>> Disabled you mean?
>>
>>> Can enable the op tracker using ceph osd tell? Than reproduce the
>>> problem. Check what has stucked again? Or should i generate an rbd log
>>> from the client?
>>
>> From a super-quick glance at the code, it looks like that isn't a
>> dynamic setting. Of course, it's possible that if you restart OSD 46
>> to enable the op tracker, the stuck op will clear itself and the VM
>> will resume. You could attempt to generate a gcore of OSD 46 to see if
>> information on that op could be extracted via the debugger, but no
>> guarantees.
>>
>> You might want to verify that the stuck client and OSD 46 have an
>> actual established TCP connection as well before doing any further
>> actions.
>>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to start Ceph disk activation: /dev/dm-18

2017-05-16 Thread Kevin Olbrich
Hi,

seems that I found the cause. The disk array was used for ZFS before and
was not wiped.
I zapped the disks with sgdisk and via ceph but "zfs_member" was still
somewhere on the disk.
Wiping the disk (wipefs -a -f /dev/mapper/mpatha), "ceph osd create
--zap-disk" twice until entry in "df"  and reboot fixed it.

Then OSDs were failing again. Cause: IPv6 DAD on bond-interface. Disabled
via sysctl.
Reboot and voila, cluster immediately online.

Kind regards,
Kevin.

2017-05-16 16:59 GMT+02:00 Kevin Olbrich :

> HI!
>
> Currently I am deploying a small cluster with two nodes. I installed ceph
> jewel on all nodes and made a basic deployment.
> After "ceph osd create..." I am now getting "Failed to start Ceph disk
> activation: /dev/dm-18" on boot. All 28 OSDs were never active.
> This server has a 14 disk JBOD with 4x fiber using multipath (4x active
> multibus). We have two servers.
>
> OS: Latest CentOS 7
>
> [root@osd01 ~]# ceph -v
>> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>
>
> Command run:
>
>> ceph-deploy osd create osd01.example.local:/dev/mappe
>> r/mpatha:/dev/disk/by-partlabel/journal01
>
>
> There is no error in journalctl, just that the unit failed:
>
>> May 16 16:47:33 osd01.example.local systemd[1]: Failed to start Ceph disk
>> activation: /dev/dm-27.
>> May 16 16:47:33 osd01.example.local systemd[1]: ceph-disk@dev-dm
>> \x2d27.service: main process exited, code=exited, status=124/n/a
>> May 16 16:47:33 osd01.example.local systemd[1]: 
>> ceph-disk@dev-dm\x2d24.service
>> failed.
>> May 16 16:47:33 osd01.example.local systemd[1]: Unit 
>> ceph-disk@dev-dm\x2d24.service
>> entered failed state.
>
>
> [root@osd01 ~]# gdisk -l /dev/mapper/mpatha
>> GPT fdisk (gdisk) version 0.8.6
>> Partition table scan:
>>   MBR: protective
>>   BSD: not present
>>   APM: not present
>>   GPT: present
>> Found valid GPT with protective MBR; using GPT.
>> Disk /dev/mapper/mpatha: 976642095 sectors, 465.7 GiB
>> Logical sector size: 512 bytes
>> Disk identifier (GUID): DEF0B782-3B7F-4AF5-A0CB-9E2B96C40B13
>> Partition table holds up to 128 entries
>> First usable sector is 34, last usable sector is 976642061
>> Partitions will be aligned on 2048-sector boundaries
>> Total free space is 2014 sectors (1007.0 KiB)
>> Number  Start (sector)End (sector)  Size   Code  Name
>>12048   976642061   465.7 GiB     ceph data
>
>
> I had problems with multipath in the past when running ceph but this time
> I was unable to solve the problem.
> Any ideas?
>
> Kind regards,
> Kevin.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
No I did not. I don't want that I can't reproduce it any longer.

Stefan

Excuse my typo sent from my mobile phone.

> Am 16.05.2017 um 22:54 schrieb Jason Dillaman :
> 
> It looks like it's just a ping message in that capture.
> 
> Are you saying that you restarted OSD 46 and the problem persisted?
> 
> On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello,
>> 
>> while reproducing the problem, objecter_requests looks like this:
>> 
>> {
>>"ops": [
>>{
>>"tid": 42029,
>>"pg": "5.bd9616ad",
>>"osd": 46,
>>"object_id": "rbd_data.e10ca56b8b4567.311c",
>>"object_locator": "@5",
>>"target_object_id": "rbd_data.e10ca56b8b4567.311c",
>>"target_object_locator": "@5",
>>"paused": 0,
>>"used_replica": 0,
>>"precalc_pgid": 0,
>>"last_sent": "2.28854e+06s",
>>"attempts": 1,
>>"snapid": "head",
>>"snap_context": "a07c2=[]",
>>"mtime": "2017-05-16 21:53:22.0.069541s",
>>"osd_ops": [
>>"delete"
>>]
>>}
>>],
>>"linger_ops": [
>>{
>>"linger_id": 1,
>>"pg": "5.5f3bd635",
>>"osd": 17,
>>"object_id": "rbd_header.e10ca56b8b4567",
>>"object_locator": "@5",
>>"target_object_id": "rbd_header.e10ca56b8b4567",
>>"target_object_locator": "@5",
>>"paused": 0,
>>"used_replica": 0,
>>"precalc_pgid": 0,
>>"snapid": "head",
>>"registered": "1"
>>}
>>],
>>"pool_ops": [],
>>"pool_stat_ops": [],
>>"statfs_ops": [],
>>"command_ops": []
>> }
>> 
>> Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
>> a pcap file of the traffic between them when it got stuck.
>> 
>> Greets,
>> Stefan
>> 
>>> Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
>>> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>>>  wrote:
 We've enabled the op tracker for performance reasons while using SSD
 only storage ;-(
>>> 
>>> Disabled you mean?
>>> 
 Can enable the op tracker using ceph osd tell? Than reproduce the
 problem. Check what has stucked again? Or should i generate an rbd log
 from the client?
>>> 
>>> From a super-quick glance at the code, it looks like that isn't a
>>> dynamic setting. Of course, it's possible that if you restart OSD 46
>>> to enable the op tracker, the stuck op will clear itself and the VM
>>> will resume. You could attempt to generate a gcore of OSD 46 to see if
>>> information on that op could be extracted via the debugger, but no
>>> guarantees.
>>> 
>>> You might want to verify that the stuck client and OSD 46 have an
>>> actual established TCP connection as well before doing any further
>>> actions.
>>> 
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Something I could do/test to find the bug?

Stefan

Excuse my typo sent from my mobile phone.

> Am 16.05.2017 um 22:54 schrieb Jason Dillaman :
> 
> It looks like it's just a ping message in that capture.
> 
> Are you saying that you restarted OSD 46 and the problem persisted?
> 
> On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello,
>> 
>> while reproducing the problem, objecter_requests looks like this:
>> 
>> {
>>"ops": [
>>{
>>"tid": 42029,
>>"pg": "5.bd9616ad",
>>"osd": 46,
>>"object_id": "rbd_data.e10ca56b8b4567.311c",
>>"object_locator": "@5",
>>"target_object_id": "rbd_data.e10ca56b8b4567.311c",
>>"target_object_locator": "@5",
>>"paused": 0,
>>"used_replica": 0,
>>"precalc_pgid": 0,
>>"last_sent": "2.28854e+06s",
>>"attempts": 1,
>>"snapid": "head",
>>"snap_context": "a07c2=[]",
>>"mtime": "2017-05-16 21:53:22.0.069541s",
>>"osd_ops": [
>>"delete"
>>]
>>}
>>],
>>"linger_ops": [
>>{
>>"linger_id": 1,
>>"pg": "5.5f3bd635",
>>"osd": 17,
>>"object_id": "rbd_header.e10ca56b8b4567",
>>"object_locator": "@5",
>>"target_object_id": "rbd_header.e10ca56b8b4567",
>>"target_object_locator": "@5",
>>"paused": 0,
>>"used_replica": 0,
>>"precalc_pgid": 0,
>>"snapid": "head",
>>"registered": "1"
>>}
>>],
>>"pool_ops": [],
>>"pool_stat_ops": [],
>>"statfs_ops": [],
>>"command_ops": []
>> }
>> 
>> Yes they've an established TCP connection. Qemu <=> osd.46. Attached is
>> a pcap file of the traffic between them when it got stuck.
>> 
>> Greets,
>> Stefan
>> 
>>> Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
>>> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
>>>  wrote:
 We've enabled the op tracker for performance reasons while using SSD
 only storage ;-(
>>> 
>>> Disabled you mean?
>>> 
 Can enable the op tracker using ceph osd tell? Than reproduce the
 problem. Check what has stucked again? Or should i generate an rbd log
 from the client?
>>> 
>>> From a super-quick glance at the code, it looks like that isn't a
>>> dynamic setting. Of course, it's possible that if you restart OSD 46
>>> to enable the op tracker, the stuck op will clear itself and the VM
>>> will resume. You could attempt to generate a gcore of OSD 46 to see if
>>> information on that op could be extracted via the debugger, but no
>>> guarantees.
>>> 
>>> You might want to verify that the stuck client and OSD 46 have an
>>> actual established TCP connection as well before doing any further
>>> actions.
>>> 
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer to Jewel upgrade questions

2017-05-16 Thread Frédéric Nass
- Le 16 Mai 17, à 20:43, Shain Miley  a écrit : 

> Hello,

> I am going to be upgrading our production Ceph cluster from
> Hammer/Ubuntu 14.04 to Jewel/Ubuntu 16.04 and I wanted to ask a question
> and sanity check my upgrade plan.

> Here are the steps I am planning to take during the upgrade:

Hi Shain, 

0) upgrade operating system packages first and reboot on new kernel if needed. 

> 1)Upgrade to latest hammer on current cluster
> 2)Remove or rename the existing ‘ceph’ user and ‘ceph’ group on each node
> 3)Upgrade the ceph packages to latest Jewel (mon, then osd, then rbd
> clients)
You might want to upgrade the RBD clients first. This may not be a mandatory 
step but a careful one. 

> 4)stop ceph daemons
> 5)change permissions on ceph directories and osd journals:

> find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -type d|parallel chown -R
> 64045:64045
> chown 64045:64045 /var/lib/ceph
> chown 64045:64045 /var/lib/ceph/*
> chown 64045:64045 /var/lib/ceph/bootstrap-*/*

> for ID in $(ls /var/lib/ceph/osd/|cut -d '-' -f 2); do
> JOURNAL=$(readlink -f /var/lib/ceph/osd/ceph-${ID}/journal)
> chown ceph ${JOURNAL}

You can avoid this step by adding setuser_match_path = 
/var/lib/ceph/$type/$cluster-$id to the [osd] section. This will make the Ceph 
daemons run as root if the daemon’s data directory is still owned by root. 
Newly deployed daemons will be created with data owned by user ceph and will 
run with reduced privileges, but upgraded daemons will continue to run as root. 

Or you can still change the property of the files to ceph but it might be long 
depending on the number of objects and PGs you have in your cluster, for an 
average zero benefit, especially since when bluestore comes out, you'll 
recreate all these datas. 

> 6)restart ceph daemons

> The two questions I have are:
> 1)Am I missing anything from the steps above...based on prior
> experiences performing upgrades of this kind?

> 2)Should I upgrade to Ubuntu 16.04 first and then upgrade Ceph...or vice
> versa?
This documentation (http://docs.ceph.com/docs/master/start/os-recommendations/) 
suggest to stick with Ubuntu 14.04 but RHCS KB show that RHCS 2.x (Jewel 
10.2.x) is only supported on Ubuntu 16.04. 
When upgrading from Hammer to Jewel, we upgraded OS first from RHEL 7 to 7.1 
then RHCS. I'm not sure whether you should temporarily run Hammer on Ubuntu 
16.04 or Jewel on Ubuntu 14.04. 
I would upgrade the lowest layer first (OS) of a single OSD node and see how it 
goes. 

Regards, 

Frederic. 

> Thanks in advance,
> Shain

> --
> NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org 
> |
> 202.513.3649

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com