Re: [ceph-users] PG::peek_map_epoch assertion fail

2017-12-06 Thread Gonzalo Aguilar Delgado

Hi,

Since my email server falled down because the error. I have to reply 
this way.


I added more logs:


  int r = store->omap_get_values(coll, pgmeta_oid, keys, );
  if (r == 0) {
    assert(values.size() == 2); --

 0> 2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In 
function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, 
ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311

osd/PG.cc: 3025: FAILED assert(values.size() == 2)

And this is what happens:

https://pastebin.com/9fdeUxri


What I see is that pg omap_get_values 9.b9 is causing the troubles, so I 
suppose that deleting this pg at all can make it run a little more or 
forever.



What do you think?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph luminous + multi mds: slow request. behind on trimming, failedto authpin local pins

2017-12-06 Thread Burkhard Linke

Hi,


we have upgraded our cluster to luminous 12.2.2 and wanted to use a 
second MDS for HA purposes. Upgrade itself went well, setting up the 
second MDS from the former standby-replay configuration worked, too.



But upon load both MDS got stuck and need to be restarted. It starts 
with slow requests:



2017-12-06 20:26:25.756475 7fddc4424700  0 log_channel(cluster) log 
[WRN] : slow
 request 122.370227 seconds old, received at 2017-12-06 
20:24:23.386136: client_
request(client.15057265:2898 getattr pAsLsXsFs #0x19de0f2 2017-12-06 
20:24:2

3.244096 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting


0x19de0f2 is the inode id of the directory we mount as root on most 
clients. Running daemonperf for both MDS shows a rising number of 
journal segments, accompanied with the corresponding warnings in the 
ceph log. We also see other slow requests:


2017-12-06 20:26:25.756488 7fddc4424700  0 log_channel(cluster) log 
[WRN] : slow
 request 180.346068 seconds old, received at 2017-12-06 
20:23:25.410295: client_
request(client.15163105:549847914 getattr pAs #0x19de0f2/sge-tmp 
2017-12-06
20:23:25.406481 caller_uid=1426, caller_gid=1008{}) currently failed to 
authpin

local pins

This is a client accessing a sub directory of the mount point.


On the client side (various Ubuntu kernel using kernel based cephfs) 
this leads to CPU lockups if the problem is not fixed fast enough. The 
clients need a hard reboot to recover.



We have mitigated the problem by disabling the second MDS. The MDS 
related configuration is:



[mds.ceph-storage-04]
mds_replay_interval = 10
mds_cache_memory_limit = 10737418240

[mds]
mds_beacon_grace = 60
mds_beacon_interval = 4
mds_session_timeout = 120


Data pool is on replicated HDD storage, meta data pool on replicated 
NVME storage. MDS are colocated with OSDs (12 HDD OSDs + 2 NVME OSDs, 
128 GB RAM).



The questions are:

- what is the minimum kernel version on clients required for multi mds 
setups?


- is the problem described above a known problem, e.g. a result of 
http://tracker.ceph.com/issues/21975 ?



Regards,

Burkhard Linke


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_ERR : PG_DEGRADED_FULL

2017-12-06 Thread Karun Josy
Hello,

I am seeing health error in our production cluster.

 health: HEALTH_ERR
1105420/11038158 objects misplaced (10.015%)
Degraded data redundancy: 2046/11038158 objects degraded
(0.019%), 102 pgs unclean, 2 pgs degraded
Degraded data redundancy (low space): 4 pgs backfill_toofull

The cluster space was running out.
So I was in the process of adding a disk.
Since I got this error, we deleted some of the data to create more space.


This is the current usage, after clearing some space, earlier 3 disks were
at 85%.


$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR  PGS
 0   ssd 1.86469  1.0  1909G  851G 1058G 44.59 0.78 265
16   ssd 0.87320  1.0   894G  361G  532G 40.43 0.71 112
 1   ssd 0.87320  1.0   894G  586G  307G 65.57 1.15 163
 2   ssd 0.87320  1.0   894G  490G  403G 54.84 0.96 145
17   ssd 0.87320  1.0   894G  163G  731G 18.24 0.32  58
 3   ssd 0.87320  1.0   894G  616G  277G 68.98 1.21 176
 4   ssd 0.87320  1.0   894G  593G  300G 66.42 1.17 179
 5   ssd 0.87320  1.0   894G  419G  474G 46.89 0.82 130
 6   ssd 0.87320  1.0   894G  422G  472G 47.21 0.83 129
 7   ssd 0.87320  1.0   894G  397G  496G 44.50 0.78 115
 8   ssd 0.87320  1.0   894G  656G  237G 73.44 1.29 184
 9   ssd 0.87320  1.0   894G  560G  333G 62.72 1.10 170
10   ssd 0.87320  1.0   894G  623G  270G 69.78 1.22 183
11   ssd 0.87320  1.0   894G  586G  307G 65.57 1.15 172
12   ssd 0.87320  1.0   894G  610G  283G 68.29 1.20 172
13   ssd 0.87320  1.0   894G  597G  296G 66.87 1.17 180
14   ssd 0.87320  1.0   894G  597G  296G 66.79 1.17 168
15   ssd 0.87320  1.0   894G  610G  283G 68.32 1.20 179
TOTAL 17110G 9746G 7363G 56.97

How to fix this? Please help!

Karun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to get around selinux-policy-base dependency

2017-12-06 Thread Brad Hubbard
On Thu, Dec 7, 2017 at 4:23 AM, Bryan Banister
 wrote:
> Thanks Ken, that's understandable,
> -Bryan
>
> -Original Message-
> From: Ken Dreyer [mailto:kdre...@redhat.com]
> Sent: Wednesday, December 06, 2017 12:03 PM
> To: Bryan Banister 
> Cc: Ceph Users ; Rafael Suarez 
> 
> Subject: Re: [ceph-users] Any way to get around selinux-policy-base dependency
>
> Note: External Email
> -
>
> Hi Bryan,
>
> Why not upgrade to RHEL 7.4? We don't really build Ceph to run on
> older RHEL releases.
>
> - Ken
>
> On Mon, Dec 4, 2017 at 11:26 AM, Bryan Banister
>  wrote:
>> Hi all,
>>
>>
>>
>> I would like to upgrade to the latest Luminous release but found that it
>> requires the absolute latest selinux-policy-base.  We aren’t using selinux,
>> so was wondering if there is a way around this dependency requirement?
>>
>>
>>
>> [carf-ceph-osd15][WARNIN] Error: Package: 2:ceph-selinux-12.2.2-0.el7.x86_64
>> (ceph)
>>
>> [carf-ceph-osd15][WARNIN]Requires: selinux-policy-base >=
>> 3.13.1-166.el7_4.5
>>
>> [carf-ceph-osd15][WARNIN]Installed:
>> selinux-policy-targeted-3.13.1-102.el7_3.13.noarch
>> (@rhel7.3-rhn-server-production/7.3)

If you really want to get around this and are into (re)building rpms
you could try this patch?

https://github.com/ceph/ceph/commit/ee4f172f9837f2c1b674084e0f12591bc0ea.patch

>>
>>
>>
>> Thanks for any help!
>>
>> -Bryan
>>
>>
>> 
>>
>> Note: This email is for the confidential use of the named addressee(s) only
>> and may contain proprietary, confidential or privileged information. If you
>> are not the intended recipient, you are hereby notified that any review,
>> dissemination or copying of this email is strictly prohibited, and to please
>> notify the sender immediately and destroy this email and any attachments.
>> Email transmission cannot be guaranteed to be secure or error-free. The
>> Company, therefore, does not make any guarantees as to the completeness or
>> accuracy of this email or any attachments. This email is for informational
>> purposes only and does not constitute a recommendation, offer, request or
>> solicitation of any kind to buy, sell, subscribe, redeem or perform any type
>> of transaction of a financial product.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> 
>
> Note: This email is for the confidential use of the named addressee(s) only 
> and may contain proprietary, confidential or privileged information. If you 
> are not the intended recipient, you are hereby notified that any review, 
> dissemination or copying of this email is strictly prohibited, and to please 
> notify the sender immediately and destroy this email and any attachments. 
> Email transmission cannot be guaranteed to be secure or error-free. The 
> Company, therefore, does not make any guarantees as to the completeness or 
> accuracy of this email or any attachments. This email is for informational 
> purposes only and does not constitute a recommendation, offer, request or 
> solicitation of any kind to buy, sell, subscribe, redeem or perform any type 
> of transaction of a financial product.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread David Turner
Do you have the FS mounted with a trimming ability?  What are your mount
options?

On Wed, Dec 6, 2017 at 5:30 PM Jan Pekař - Imatic 
wrote:

> Hi,
>
> On 6.12.2017 15:24, Jason Dillaman wrote:
> > On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic 
> wrote:
> >> Hi,
> >> I run to overloaded cluster (deep-scrub running) for few seconds and
> rbd-nbd
> >> client timeouted, and device become unavailable.
> >>
> >> block nbd0: Connection timed out
> >> block nbd0: shutting down sockets
> >> block nbd0: Connection timed out
> >> print_req_error: I/O error, dev nbd0, sector 2131833856
> >> print_req_error: I/O error, dev nbd0, sector 2131834112
> >>
> >> Is there any way how to extend rbd-nbd timeout?
> >
> > Support for changing the default timeout of 30 seconds is supported by
> > the kernel [1], but it's not currently implemented in rbd-nbd.  I
> > opened a new feature ticket for adding this option [2] but it may be
> > more constructive to figure out how to address a >30 second IO stall
> > on your cluster during deep-scrub.
>
> Kernel client is not supporting new image features, so I decided to use
> rbd-nbd.
> Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW
> snapshot on my healthy and almost idle cluster with only 1 deep-scrub
> running and I also hit 30s timeout and device disconnect. I'm mapping it
> from virtual server so there can be some performance issue but I'm not
> hunting performance, but stability.
>
> Thank you
> With regards
> Jan Pekar
>
> >
> >> Also getting pammed devices failed -
> >>
> >> rbd-nbd list-mapped
> >>
> >> /build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
> >> get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
> >> 09:40:33.541426
> >> /build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
> >> assert(ifs.is_open())
> >>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
> luminous
> >> (stable)
> >>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x102) [0x7f0693f567c2]
> >>   2: (()+0x14165) [0x559a8783d165]
> >>   3: (main()+0x9) [0x559a87838e59]
> >>   4: (__libc_start_main()+0xf1) [0x7f0691178561]
> >>   5: (()+0xff80) [0x559a87838f80]
> >>   NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to
> >> interpret this.
> >> Aborted
> >
> > It's been fixed in the master branch and is awaiting backport to
> > Luminous [1] -- I'd expect it to be available in v12.2.3.
> >
> >>
> >> Thank you
> >> With regards
> >> Jan Pekar
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > [1]
> https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
> > [2] http://tracker.ceph.com/issues/22333
> > [3] http://tracker.ceph.com/issues/22185
> >
> >
>
> --
> 
> Ing. Jan Pekař
> jan.pe...@imatic.cz | +420603811737 <+420%20603%20811%20737>
> 
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> 
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-06 Thread Gregory Farnum
On Wed, Dec 6, 2017 at 2:35 PM David Turner  wrote:

> I have no proof or anything other than a hunch, but OSDs don't trim omaps
> unless all PGs are healthy.  If this PG is actually not healthy, but the
> cluster doesn't realize it while these 11 involved OSDs do realize that the
> PG is unhealthy... You would see this exact problem.  The OSDs think a PG
> is unhealthy so they aren't trimming their omaps while the cluster doesn't
> seem to be aware of it and everything else is trimming their omaps properly.
>

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are
stored in leveldb, but they have different trimming rules.


>
> I don't know what to do about it, but I hope it helps get you (or someone
> else on the ML) towards a resolution.
>
> On Wed, Dec 6, 2017 at 1:59 PM  wrote:
>
>> Hi ceph-users,
>>
>> We have a Ceph cluster (running Kraken) that is exhibiting some odd
>> behaviour.
>> A couple weeks ago, the LevelDBs on some our OSDs started growing large
>> (now at around 20G size).
>>
>> The one thing they have in common is the 11 disks with inflating LevelDBs
>> are all in the set for one PG in one of our pools (EC 8+3). This pool
>> started to see use around the time the LevelDBs started inflating.
>> Compactions are running and they do go down in size a bit but the overall
>> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
>> LevelDBs between 650M and 1.2G.
>> This PG has nothing to separate it from the others in its pool, within 5%
>> of average number of objects per PG, no hot-spotting in terms of load, no
>> weird states reported by ceph status.
>>
>> The one odd thing about it is the pg query output mentions it is
>> active+clean, but it has a recovery state, which it enters every morning
>> between 9 and 10am, where it mentions a "might_have_unfound" situation and
>> having probed all other set members. A deep scrub of the PG didn't turn up
>> anything.
>>
>
You need to be more specific here. What do you mean it "enters into" the
recovery state every morning?

How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools?
What are you using the cluster for?


>> The cluster is now starting to manifest slow requests on the OSDs with
>> the large LevelDBs, although not in the particular PG.
>>
>> What can I do to diagnose and resolve this?
>>
>> Thanks,
>>
>> George
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-06 Thread David Turner
I have no proof or anything other than a hunch, but OSDs don't trim omaps
unless all PGs are healthy.  If this PG is actually not healthy, but the
cluster doesn't realize it while these 11 involved OSDs do realize that the
PG is unhealthy... You would see this exact problem.  The OSDs think a PG
is unhealthy so they aren't trimming their omaps while the cluster doesn't
seem to be aware of it and everything else is trimming their omaps properly.

I don't know what to do about it, but I hope it helps get you (or someone
else on the ML) towards a resolution.

On Wed, Dec 6, 2017 at 1:59 PM  wrote:

> Hi ceph-users,
>
> We have a Ceph cluster (running Kraken) that is exhibiting some odd
> behaviour.
> A couple weeks ago, the LevelDBs on some our OSDs started growing large
> (now at around 20G size).
>
> The one thing they have in common is the 11 disks with inflating LevelDBs
> are all in the set for one PG in one of our pools (EC 8+3). This pool
> started to see use around the time the LevelDBs started inflating.
> Compactions are running and they do go down in size a bit but the overall
> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
> LevelDBs between 650M and 1.2G.
> This PG has nothing to separate it from the others in its pool, within 5%
> of average number of objects per PG, no hot-spotting in terms of load, no
> weird states reported by ceph status.
>
> The one odd thing about it is the pg query output mentions it is
> active+clean, but it has a recovery state, which it enters every morning
> between 9 and 10am, where it mentions a "might_have_unfound" situation and
> having probed all other set members. A deep scrub of the PG didn't turn up
> anything.
>
> The cluster is now starting to manifest slow requests on the OSDs with the
> large LevelDBs, although not in the particular PG.
>
> What can I do to diagnose and resolve this?
>
> Thanks,
>
> George
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread Jan Pekař - Imatic

Hi,

On 6.12.2017 15:24, Jason Dillaman wrote:

On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic  wrote:

Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd
client timeouted, and device become unavailable.

block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112

Is there any way how to extend rbd-nbd timeout?


Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd.  I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.


Kernel client is not supporting new image features, so I decided to use 
rbd-nbd.
Now I tried to rm 300GB folder, which is mounted with rbd-nbd from COW 
snapshot on my healthy and almost idle cluster with only 1 deep-scrub 
running and I also hit 30s timeout and device disconnect. I'm mapping it 
from virtual server so there can be some performance issue but I'm not 
hunting performance, but stability.


Thank you
With regards
Jan Pekar




Also getting pammed devices failed -

rbd-nbd list-mapped

/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
assert(ifs.is_open())
  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x7f0693f567c2]
  2: (()+0x14165) [0x559a8783d165]
  3: (main()+0x9) [0x559a87838e59]
  4: (__libc_start_main()+0xf1) [0x7f0691178561]
  5: (()+0xff80) [0x559a87838f80]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.
Aborted


It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.



Thank you
With regards
Jan Pekar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185




--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sudden omap growth on some OSDs

2017-12-06 Thread george.vasilakakos
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

The cluster is now starting to manifest slow requests on the OSDs with the 
large LevelDBs, although not in the particular PG.

What can I do to diagnose and resolve this?

Thanks,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to get around selinux-policy-base dependency

2017-12-06 Thread Bryan Banister
Thanks Ken, that's understandable,
-Bryan

-Original Message-
From: Ken Dreyer [mailto:kdre...@redhat.com]
Sent: Wednesday, December 06, 2017 12:03 PM
To: Bryan Banister 
Cc: Ceph Users ; Rafael Suarez 

Subject: Re: [ceph-users] Any way to get around selinux-policy-base dependency

Note: External Email
-

Hi Bryan,

Why not upgrade to RHEL 7.4? We don't really build Ceph to run on
older RHEL releases.

- Ken

On Mon, Dec 4, 2017 at 11:26 AM, Bryan Banister
 wrote:
> Hi all,
>
>
>
> I would like to upgrade to the latest Luminous release but found that it
> requires the absolute latest selinux-policy-base.  We aren’t using selinux,
> so was wondering if there is a way around this dependency requirement?
>
>
>
> [carf-ceph-osd15][WARNIN] Error: Package: 2:ceph-selinux-12.2.2-0.el7.x86_64
> (ceph)
>
> [carf-ceph-osd15][WARNIN]Requires: selinux-policy-base >=
> 3.13.1-166.el7_4.5
>
> [carf-ceph-osd15][WARNIN]Installed:
> selinux-policy-targeted-3.13.1-102.el7_3.13.noarch
> (@rhel7.3-rhn-server-production/7.3)
>
>
>
> Thanks for any help!
>
> -Bryan
>
>
> 
>
> Note: This email is for the confidential use of the named addressee(s) only
> and may contain proprietary, confidential or privileged information. If you
> are not the intended recipient, you are hereby notified that any review,
> dissemination or copying of this email is strictly prohibited, and to please
> notify the sender immediately and destroy this email and any attachments.
> Email transmission cannot be guaranteed to be secure or error-free. The
> Company, therefore, does not make any guarantees as to the completeness or
> accuracy of this email or any attachments. This email is for informational
> purposes only and does not constitute a recommendation, offer, request or
> solicitation of any kind to buy, sell, subscribe, redeem or perform any type
> of transaction of a financial product.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to get around selinux-policy-base dependency

2017-12-06 Thread Ken Dreyer
Hi Bryan,

Why not upgrade to RHEL 7.4? We don't really build Ceph to run on
older RHEL releases.

- Ken

On Mon, Dec 4, 2017 at 11:26 AM, Bryan Banister
 wrote:
> Hi all,
>
>
>
> I would like to upgrade to the latest Luminous release but found that it
> requires the absolute latest selinux-policy-base.  We aren’t using selinux,
> so was wondering if there is a way around this dependency requirement?
>
>
>
> [carf-ceph-osd15][WARNIN] Error: Package: 2:ceph-selinux-12.2.2-0.el7.x86_64
> (ceph)
>
> [carf-ceph-osd15][WARNIN]Requires: selinux-policy-base >=
> 3.13.1-166.el7_4.5
>
> [carf-ceph-osd15][WARNIN]Installed:
> selinux-policy-targeted-3.13.1-102.el7_3.13.noarch
> (@rhel7.3-rhn-server-production/7.3)
>
>
>
> Thanks for any help!
>
> -Bryan
>
>
> 
>
> Note: This email is for the confidential use of the named addressee(s) only
> and may contain proprietary, confidential or privileged information. If you
> are not the intended recipient, you are hereby notified that any review,
> dissemination or copying of this email is strictly prohibited, and to please
> notify the sender immediately and destroy this email and any attachments.
> Email transmission cannot be guaranteed to be secure or error-free. The
> Company, therefore, does not make any guarantees as to the completeness or
> accuracy of this email or any attachments. This email is for informational
> purposes only and does not constitute a recommendation, offer, request or
> solicitation of any kind to buy, sell, subscribe, redeem or perform any type
> of transaction of a financial product.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

2017-12-06 Thread David Turner
Why are you flushing the journal after you zero it instead of before? That
does nothing. You want to flush the journal while it has objects that might
not be on the osd before you zero it.

On Wed, Dec 6, 2017, 6:02 AM Ronny Aasen  wrote:

> On 06. des. 2017 10:01, Gonzalo Aguilar Delgado wrote:
> > Hi,
> >
> > Another OSD falled down. And it's pretty scary how easy is to break the
> > cluster. This time is something related to the journal.
> >
> >
> > /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
> > starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6
> > /var/lib/ceph/osd/ceph-6/journal
> > 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 log_to_monitors
> > {default=true}
> > os/filestore/FileStore.cc: In function 'void
> > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
> > ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05
> 13:19:04.433036
> > os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> >   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x55569c1ff790]
> >   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
> > long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
> >   3: (FileStore::_do_transactions(std::vector > std::allocator >&, unsigned long,
> > ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
> >   4: (FileStore::_do_op(FileStore::OpSequencer*,
> > ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
> >   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
> >   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
> >   7: (()+0x76ba) [0x7f24503e36ba]
> >   8: (clone()+0x6d) [0x7f244e45b3dd]
> >   NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> > 2017-12-05 13:19:04.437968 7f243d1a0700 -1 os/filestore/FileStore.cc: In
> > function 'void FileStore::_do_transaction(ObjectStore::Transaction&,
> > uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time
> > 2017-12-05 13:19:04.433036
> > os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> >
> >   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x55569c1ff790]
> >   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
> > long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
> >   3: (FileStore::_do_transactions(std::vector > std::allocator >&, unsigned long,
> > ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
> >   4: (FileStore::_do_op(FileStore::OpSequencer*,
> > ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
> >   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
> >   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
> >   7: (()+0x76ba) [0x7f24503e36ba]
> >   8: (clone()+0x6d) [0x7f244e45b3dd]
> >   NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> > os/filestore/FileStore.cc: In function 'void
> > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
> > ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05
> 13:19:04.435362
> > os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> >   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x55569c1ff790]
> >   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
> > long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
> >   3: (FileStore::_do_transactions(std::vector > std::allocator >&, unsigned long,
> > ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
> >   4: (FileStore::_do_op(FileStore::OpSequencer*,
> > ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
> >   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
> >   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
> >   7: (()+0x76ba) [0x7f24503e36ba]
> >   8: (clone()+0x6d) [0x7f244e45b3dd]
> >   NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >-405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538
> > log_to_monitors {default=true}
> >   0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1
> > os/filestore/FileStore.cc: In function 'void
> > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
> > ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05
> 13:19:04.433036
> > os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> >
> >   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x55569c1ff790]
> >   2: (FileStore::_do_transaction(ObjectStore::Transaction&, 

Re: [ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread Jason Dillaman
On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic  wrote:
> Hi,
> I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd
> client timeouted, and device become unavailable.
>
> block nbd0: Connection timed out
> block nbd0: shutting down sockets
> block nbd0: Connection timed out
> print_req_error: I/O error, dev nbd0, sector 2131833856
> print_req_error: I/O error, dev nbd0, sector 2131834112
>
> Is there any way how to extend rbd-nbd timeout?

Support for changing the default timeout of 30 seconds is supported by
the kernel [1], but it's not currently implemented in rbd-nbd.  I
opened a new feature ticket for adding this option [2] but it may be
more constructive to figure out how to address a >30 second IO stall
on your cluster during deep-scrub.

> Also getting pammed devices failed -
>
> rbd-nbd list-mapped
>
> /build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int
> get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06
> 09:40:33.541426
> /build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED
> assert(ifs.is_open())
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x7f0693f567c2]
>  2: (()+0x14165) [0x559a8783d165]
>  3: (main()+0x9) [0x559a87838e59]
>  4: (__libc_start_main()+0xf1) [0x7f0691178561]
>  5: (()+0xff80) [0x559a87838f80]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> Aborted

It's been fixed in the master branch and is awaiting backport to
Luminous [1] -- I'd expect it to be available in v12.2.3.

>
> Thank you
> With regards
> Jan Pekar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/torvalds/linux/blob/master/drivers/block/nbd.c#L1166
[2] http://tracker.ceph.com/issues/22333
[3] http://tracker.ceph.com/issues/22185


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Yehuda Sadeh-Weinraub
It's hard to say, we don't really test your specific scenario so use
it with your own risk. There was a change in cls_refcount that we had
issues with in the upgrade suite, but looking at it I'm not sure it'll
actually be a problem for you (you'll still hit the original problem
though).
Other problematic area is the osd limit on large omap operations, for
which we added 'truncated' flag for the relevant objclass operations.
Running older rgw against newer osds might cause issues when listing
omaps. You should make sure that bucket listing works correctly, but
there may be other issues (garbage collector, listing of user's
buckets, multipart upload completion). It could be that you can
configure that osd limit to be a higher number, so that you won't hit
that issue (RGW probably never requests more than 1000 entries off
omap, so setting it to 1k should be fine).

Yehuda

On Wed, Dec 6, 2017 at 2:09 PM, Wido den Hollander  wrote:
>
>> Op 6 december 2017 om 10:25 schreef Yehuda Sadeh-Weinraub 
>> :
>>
>>
>> Are you using rgw? There are certain compatibility issues that you
>> might hit if you run mixed versions.
>>
>
> Yes, it is. So would it hurt if OSDs are running Luminous but the RGW is 
> still Jewel?
>
> Multisite isn't used, it's just a local RGW.
>
> Wido
>
>> Yehuda
>>
>> On Tue, Dec 5, 2017 at 3:20 PM, Wido den Hollander  wrote:
>> > Hi,
>> >
>> > I haven't tried this before but I expect it to work, but I wanted to check 
>> > before proceeding.
>> >
>> > I have a Ceph cluster which is running with manually formatted FileStore 
>> > XFS disks, Jewel, sysvinit and Ubuntu 14.04.
>> >
>> > I would like to upgrade this system to Luminous, but since I have to 
>> > re-install all servers and re-format all disks I'd like to move it to 
>> > BlueStore at the same time.
>> >
>> > This system however has 768 3TB disks and has a utilization of about 60%. 
>> > You can guess, it will take a long time before all the backfills complete.
>> >
>> > The idea is to take a machine down, wipe all disks, re-install it with 
>> > Ubuntu 16.04 and Luminous and re-format the disks with BlueStore.
>> >
>> > The OSDs get back, start to backfill and we wait.
>> >
>> > My estimation is that we can do one machine per day, but we have 48 
>> > machines to do. Realistically this will take ~60 days to complete.
>> >
>> > Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work 
>> > just fine I wanted to check if there are any caveats I don't know about.
>> >
>> > I'll upgrade the MONs to Luminous first before starting to upgrade the 
>> > OSDs. Between each machine I'll wait for a HEALTH_OK before proceeding 
>> > allowing the MONs to trim their datastore.
>> >
>> > The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?
>> >
>> > I think it won't, but I wanted to double-check.
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph.conf tuning ... please comment

2017-12-06 Thread Van Leeuwen, Robert
Hi,

Lets start with a disclaimer: Not an expert on any of these ceph tuning 
settings :)

However, in general with cluster intervals/timings:
You are trading quick failovers detection for:
1) Processing power: 
You might starve yourself of resources when expanding the cluster.
If you multiply all the changes you might actually create a lot of network 
chatter and need a lot of extra processing power.
With this in mind I would be very careful changing any setting by an order of 
magnitude unless you exactly know the impact.
2) You might create a very “nervous” cluster. E.g. a short network hiccup: OSDs 
marked out.
This creates a large amount of data shuffling which could flood a network link 
which could create more short network hiccups which would create flapping OSDs 
etc.

IMHO would change the mindset on tuning from: 
- what is the fastest possible failure detection time of a broken datacenter
to:
- In case of catastrophic DC failure (which happens maybe once every few years) 
are the defaults that bad that you must change from what is the widely-tested 
deployment?

Off course it might be that the defaults are just a random number a dev put in 
and this is exactly what should be done in each deployment ;)
I am sure some other people have better insights in these specific settings.

Cheers,
Robert van Leeuwen



On 12/6/17, 7:01 AM, "ceph-users on behalf of Stefan Kooman" 
 wrote:

Dear list,

In a ceph blog post about the new Luminous release there is a paragraph
on the need for ceph tuning [1]:

"If you are a Ceph power user and believe there is some setting that you
need to change for your environment to get the best performance, please
tell uswed like to either adjust our defaults so that your change isnt
necessary or have a go at convincing you that you shouldnt be tuning
that option."

We have been tuning several ceph.conf parameters in order to allow for
"fast failure" when an entire datacenter goes offline. We now have
continued operation (no pending IO) after ~ 7 seconds. We have changed
the following parameters:

[global]
# 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585=arg0XlbGTI6w3Kr2Rf0KRPa5U3VJS5pyDGpgA8NzX%2FA%3D=0
osd heartbeat grace = 4  # default 6
# Do _NOT_  scale based on laggy estimations
mon osd adjust heartbeat grace = false

^^ without this setting it could take up to two minutes before ceph
flagged a whole datacenter down (after we cut connectivity to the DC).
Not sure how the estimation is done, but not good enough for us.

[mon]
# 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-config-ref%2F=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585=tYkO4M5iiK38u%2By0ZMBdXEROH2OtfoQM2iPSnAHc12k%3D=0
# TUNING #
mon lease = 1.0# default 5
mon election timeout = 2   # default 5 
mon lease renew interval factor = 0.4  # default 0.6
mon lease ack timeout factor = 1.5 # default 2.0
mon timecheck interval = 60# default 300

Above checks are there to make the whole process faster. After a DC
failure the monitors will need a re-election (depending on what DC and
who was a leader and who were peon). While going through mon
debug logging we have observed that this whole process is really fast
(things happen to be done in milliseconds). We have a quite low latency
network, so I guess we can cut some slack here. Ceph won't make any
decisions while there is no consensus, so better get that consensus as
soon as possible.

# 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F%23monitor-settings=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585=VRr2wpmbugh9YodMppLP24odLyHKJASh%2BQ%2FcGjog568%3D=0
 
mon osd reporter subtree level = datacenter

^^ We do want to make sure at least two datacenters are seeing a
datacenter go down, not individual hosts.

[osd]
# 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fmaster%2Frados%2Fconfiguration%2Fmon-osd-interaction%2F=02%7C01%7Crovanleeuwen%40ebay.com%7C0fba2ce3537b4bf4de9308d53c6f066d%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636481369899935585=arg0XlbGTI6w3Kr2Rf0KRPa5U3VJS5pyDGpgA8NzX%2FA%3D=0
osd crush update on start = 

Re: [ceph-users] ceph.conf tuning ... please comment

2017-12-06 Thread Piotr Dałek

On 17-12-06 07:01 AM, Stefan Kooman wrote:

[osd]
# http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
osd crush update on start = false
osd heartbeat interval = 1 # default 6
osd mon heartbeat interval = 10# default 30
osd mon report interval min = 1# default 5
osd mon report interval max = 15   # default 120

The osd would almost immediately see a "cut off" to their partner OSD's
in the placement group. By default they wait 6 seconds before sending
their report to the monitors. During our analysis this is exactly the
time the monitors were keeping an election. By tuning all of the above
we could get them to send their reports faster, and by the time the
election process was finished the monitors would handle the reports from
the OSDs and come to the conclusion that a DC is down, flag it down
and allow for normal client IO again.

Of course, stability and data safety is most important to us. So if any
of these settings make you worry please let us know.


Heartbeats, especially in Luminous, are quite heavy bandwidth-wise if you 
have a lot of OSDs in clusters. You may want to keep osd heartbeat interval 
at 3 lowest, or if that's not acceptable then at least set "osd heartbeat 
min size" to 0.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Wido den Hollander

> Op 6 december 2017 om 10:25 schreef Yehuda Sadeh-Weinraub :
> 
> 
> Are you using rgw? There are certain compatibility issues that you
> might hit if you run mixed versions.
> 

Yes, it is. So would it hurt if OSDs are running Luminous but the RGW is still 
Jewel?

Multisite isn't used, it's just a local RGW.

Wido

> Yehuda
> 
> On Tue, Dec 5, 2017 at 3:20 PM, Wido den Hollander  wrote:
> > Hi,
> >
> > I haven't tried this before but I expect it to work, but I wanted to check 
> > before proceeding.
> >
> > I have a Ceph cluster which is running with manually formatted FileStore 
> > XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> >
> > I would like to upgrade this system to Luminous, but since I have to 
> > re-install all servers and re-format all disks I'd like to move it to 
> > BlueStore at the same time.
> >
> > This system however has 768 3TB disks and has a utilization of about 60%. 
> > You can guess, it will take a long time before all the backfills complete.
> >
> > The idea is to take a machine down, wipe all disks, re-install it with 
> > Ubuntu 16.04 and Luminous and re-format the disks with BlueStore.
> >
> > The OSDs get back, start to backfill and we wait.
> >
> > My estimation is that we can do one machine per day, but we have 48 
> > machines to do. Realistically this will take ~60 days to complete.
> >
> > Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work just 
> > fine I wanted to check if there are any caveats I don't know about.
> >
> > I'll upgrade the MONs to Luminous first before starting to upgrade the 
> > OSDs. Between each machine I'll wait for a HEALTH_OK before proceeding 
> > allowing the MONs to trim their datastore.
> >
> > The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?
> >
> > I think it won't, but I wanted to double-check.
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

2017-12-06 Thread Ronny Aasen

On 06. des. 2017 10:01, Gonzalo Aguilar Delgado wrote:

Hi,

Another OSD falled down. And it's pretty scary how easy is to break the 
cluster. This time is something related to the journal.



/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
/var/lib/ceph/osd/ceph-6/journal
2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 log_to_monitors 
{default=true}
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
2017-12-05 13:19:04.437968 7f243d1a0700 -1 os/filestore/FileStore.cc: In 
function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time 
2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
   -405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 
log_to_monitors {default=true}
  0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1 
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or 

Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Richard Hesketh
On 06/12/17 09:17, Caspar Smit wrote:
> 
> 2017-12-05 18:39 GMT+01:00 Richard Hesketh  >:
> 
> On 05/12/17 17:10, Graham Allan wrote:
> > On 12/05/2017 07:20 AM, Wido den Hollander wrote:
> >> Hi,
> >>
> >> I haven't tried this before but I expect it to work, but I wanted to
> >> check before proceeding.
> >>
> >> I have a Ceph cluster which is running with manually formatted
> >> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> >>
> >> I would like to upgrade this system to Luminous, but since I have to
> >> re-install all servers and re-format all disks I'd like to move it to
> >> BlueStore at the same time.
> >
> > You don't *have* to update the OS in order to update to Luminous, do 
> you? Luminous is still supported on Ubuntu 14.04 AFAIK.
> >
> > Though obviously I understand your desire to upgrade; I only ask 
> because I am in the same position (Ubuntu 14.04, xfs, sysvinit), though 
> happily with a smaller cluster. Personally I was planning to upgrade ours 
> entirely to Luminous while still on Ubuntu 14.04, before later going through 
> the same process of decommissioning one machine at a time to reinstall with 
> CentOS 7 and Bluestore. I too don't see any reason the mixed Jewel/Luminous 
> cluster wouldn't work, but still felt less comfortable with extending the 
> upgrade duration.
> >
> > Graham
> 
> Yes, you can run luminous on Trusty; one of my clusters is currently 
> Luminous/Bluestore/Trusty as I've not had time to sort out doing OS upgrades 
> on it. I second the suggestion that it would be better to do the luminous 
> upgrade first, retaining existing filestore OSDs, and then do the OS 
> upgrade/OSD recreation on each node in sequence. I don't think there should 
> realistically be any problems with running a mixed cluster for a while but 
> doing the jewel->luminous upgrade on the existing installs first shouldn't be 
> significant extra effort/time as you're already predicting at least two 
> months to upgrade everything, and it does minimise the amount of change at 
> any one time in case things do start going horribly wrong.
> 
> Also, at 48 nodes, I would've thought you could get away with cycling 
> more than one of them at once. Assuming they're homogenous taking out even 4 
> at a time should only raise utilisation on the rest of the cluster to a 
> little over 65%, which still seems safe to me, and you'd waste way less time 
> waiting for recovery. (I recognise that depending on the nature of your 
> employment situation this may not actually be desirable...)
> 
>  
> Assuming size=3 and min_size=2 and failure-domain=host:
> 
> I always thought that bringing down more then 1 host cause data 
> inaccessebility right away because the chance that a pg will have osd's in 
> these 2 hosts is there. Only if the failure-domain is higher then host (rack 
> or something) you can safely bring more then 1 host down (in the same failure 
> domain offcourse).
> 
> Am i right? 
> 
> Kind regards,
> Caspar

Oh, yeah, if you just bring them down immediately without rebalancing first, 
you'll have problems. But the intention is that rather than just killing the 
nodes, you first weight them to 0 and then wait for the cluster to rebalance 
the data off them so they are empty and harmless when you do shut them down. 
You minimise time spent waiting and overall data movement if you do this sort 
of replacement in larger batches. Others have correctly pointed out though that 
the larger the change you make at any one time, the more likely something might 
go wrong overall... I suspect a good rule of thumb is that you should try to 
add/replace/remove nodes/OSDs in batches of as many you can get away with at 
once without stretching outside the failure domain.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Yehuda Sadeh-Weinraub
Are you using rgw? There are certain compatibility issues that you
might hit if you run mixed versions.

Yehuda

On Tue, Dec 5, 2017 at 3:20 PM, Wido den Hollander  wrote:
> Hi,
>
> I haven't tried this before but I expect it to work, but I wanted to check 
> before proceeding.
>
> I have a Ceph cluster which is running with manually formatted FileStore XFS 
> disks, Jewel, sysvinit and Ubuntu 14.04.
>
> I would like to upgrade this system to Luminous, but since I have to 
> re-install all servers and re-format all disks I'd like to move it to 
> BlueStore at the same time.
>
> This system however has 768 3TB disks and has a utilization of about 60%. You 
> can guess, it will take a long time before all the backfills complete.
>
> The idea is to take a machine down, wipe all disks, re-install it with Ubuntu 
> 16.04 and Luminous and re-format the disks with BlueStore.
>
> The OSDs get back, start to backfill and we wait.
>
> My estimation is that we can do one machine per day, but we have 48 machines 
> to do. Realistically this will take ~60 days to complete.
>
> Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work just 
> fine I wanted to check if there are any caveats I don't know about.
>
> I'll upgrade the MONs to Luminous first before starting to upgrade the OSDs. 
> Between each machine I'll wait for a HEALTH_OK before proceeding allowing the 
> MONs to trim their datastore.
>
> The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?
>
> I think it won't, but I wanted to double-check.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Wido den Hollander

> Op 6 december 2017 om 10:17 schreef Caspar Smit :
> 
> 
> 2017-12-05 18:39 GMT+01:00 Richard Hesketh :
> 
> > On 05/12/17 17:10, Graham Allan wrote:
> > > On 12/05/2017 07:20 AM, Wido den Hollander wrote:
> > >> Hi,
> > >>
> > >> I haven't tried this before but I expect it to work, but I wanted to
> > >> check before proceeding.
> > >>
> > >> I have a Ceph cluster which is running with manually formatted
> > >> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> > >>
> > >> I would like to upgrade this system to Luminous, but since I have to
> > >> re-install all servers and re-format all disks I'd like to move it to
> > >> BlueStore at the same time.
> > >
> > > You don't *have* to update the OS in order to update to Luminous, do
> > you? Luminous is still supported on Ubuntu 14.04 AFAIK.
> > >
> > > Though obviously I understand your desire to upgrade; I only ask because
> > I am in the same position (Ubuntu 14.04, xfs, sysvinit), though happily
> > with a smaller cluster. Personally I was planning to upgrade ours entirely
> > to Luminous while still on Ubuntu 14.04, before later going through the
> > same process of decommissioning one machine at a time to reinstall with
> > CentOS 7 and Bluestore. I too don't see any reason the mixed Jewel/Luminous
> > cluster wouldn't work, but still felt less comfortable with extending the
> > upgrade duration.
> > >
> > > Graham
> >
> > Yes, you can run luminous on Trusty; one of my clusters is currently
> > Luminous/Bluestore/Trusty as I've not had time to sort out doing OS
> > upgrades on it. I second the suggestion that it would be better to do the
> > luminous upgrade first, retaining existing filestore OSDs, and then do the
> > OS upgrade/OSD recreation on each node in sequence. I don't think there
> > should realistically be any problems with running a mixed cluster for a
> > while but doing the jewel->luminous upgrade on the existing installs first
> > shouldn't be significant extra effort/time as you're already predicting at
> > least two months to upgrade everything, and it does minimise the amount of
> > change at any one time in case things do start going horribly wrong.
> >
> > Also, at 48 nodes, I would've thought you could get away with cycling more
> > than one of them at once. Assuming they're homogenous taking out even 4 at
> > a time should only raise utilisation on the rest of the cluster to a little
> > over 65%, which still seems safe to me, and you'd waste way less time
> > waiting for recovery. (I recognise that depending on the nature of your
> > employment situation this may not actually be desirable...)
> >
> >
> Assuming size=3 and min_size=2 and failure-domain=host:
> 
> I always thought that bringing down more then 1 host cause data
> inaccessebility right away because the chance that a pg will have osd's in
> these 2 hosts is there. Only if the failure-domain is higher then host
> (rack or something) you can safely bring more then 1 host down (in the same
> failure domain offcourse).
> 
> Am i right?

Yes, you are right. This cluster in this case has failure domain set to 'rack' 
and thus allows for multiple machines in one rack to go down without impacting 
availability.

> 
> Kind regards,
> Caspar
> 
> 
> > Rich
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Wido den Hollander

> Op 5 december 2017 om 18:39 schreef Richard Hesketh 
> :
> 
> 
> On 05/12/17 17:10, Graham Allan wrote:
> > On 12/05/2017 07:20 AM, Wido den Hollander wrote:
> >> Hi,
> >>
> >> I haven't tried this before but I expect it to work, but I wanted to
> >> check before proceeding.
> >>
> >> I have a Ceph cluster which is running with manually formatted
> >> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> >>
> >> I would like to upgrade this system to Luminous, but since I have to
> >> re-install all servers and re-format all disks I'd like to move it to
> >> BlueStore at the same time.
> > 
> > You don't *have* to update the OS in order to update to Luminous, do you? 
> > Luminous is still supported on Ubuntu 14.04 AFAIK.
> > 
> > Though obviously I understand your desire to upgrade; I only ask because I 
> > am in the same position (Ubuntu 14.04, xfs, sysvinit), though happily with 
> > a smaller cluster. Personally I was planning to upgrade ours entirely to 
> > Luminous while still on Ubuntu 14.04, before later going through the same 
> > process of decommissioning one machine at a time to reinstall with CentOS 7 
> > and Bluestore. I too don't see any reason the mixed Jewel/Luminous cluster 
> > wouldn't work, but still felt less comfortable with extending the upgrade 
> > duration.
> > 

Well, the sysvinit part bothers me. This setup uses the 'devs' part in 
ceph.conf and such. It's all a kind of hacky system.

Most of these systems have run Dumpling on Ubuntu 12.04 and have been upgraded 
ever since. They are messy.

We'd like to reprovision all disks with ceph-volume while we are at it. It 
would be one step by doing the OS and Ceph at the same time.

I've never tried to run Luminous under 14.04. Looking at the DEB packages there 
doesn't seem to be sysvinit support anymore in Luminous either.

> > Graham
> 
> Yes, you can run luminous on Trusty; one of my clusters is currently 
> Luminous/Bluestore/Trusty as I've not had time to sort out doing OS upgrades 
> on it. I second the suggestion that it would be better to do the luminous 
> upgrade first, retaining existing filestore OSDs, and then do the OS 
> upgrade/OSD recreation on each node in sequence. I don't think there should 
> realistically be any problems with running a mixed cluster for a while but 
> doing the jewel->luminous upgrade on the existing installs first shouldn't be 
> significant extra effort/time as you're already predicting at least two 
> months to upgrade everything, and it does minimise the amount of change at 
> any one time in case things do start going horribly wrong.
> 

I agree that less things at once are best. But we will at least automate the 
whole install/config using Salt, so that part if covered.

The Luminous on Trusty, does that run with sysvinit or with Upstart?

> Also, at 48 nodes, I would've thought you could get away with cycling more 
> than one of them at once. Assuming they're homogenous taking out even 4 at a 
> time should only raise utilisation on the rest of the cluster to a little 
> over 65%, which still seems safe to me, and you'd waste way less time waiting 
> for recovery. (I recognise that depending on the nature of your employment 
> situation this may not actually be desirable...)
> 

We can probably do more then one node at the same time, however, I'm setting up 
a plan which the admins will execute and we want to take the safe route. Uptime 
is important as well.

If we screw up a node the damage isn't that big.

But the main question remains: Can you run a mix of Jewel and Luminous for a 
longer period.

If so, what are the caveats?

Once clusters keep growing they will need to run a mix of versions. I have 
other clusters which are running Jewel and have 400 nodes. Upgrading them all 
will take a lof of time as well.

Thanks,

Wido

> Rich
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Caspar Smit
2017-12-05 18:39 GMT+01:00 Richard Hesketh :

> On 05/12/17 17:10, Graham Allan wrote:
> > On 12/05/2017 07:20 AM, Wido den Hollander wrote:
> >> Hi,
> >>
> >> I haven't tried this before but I expect it to work, but I wanted to
> >> check before proceeding.
> >>
> >> I have a Ceph cluster which is running with manually formatted
> >> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> >>
> >> I would like to upgrade this system to Luminous, but since I have to
> >> re-install all servers and re-format all disks I'd like to move it to
> >> BlueStore at the same time.
> >
> > You don't *have* to update the OS in order to update to Luminous, do
> you? Luminous is still supported on Ubuntu 14.04 AFAIK.
> >
> > Though obviously I understand your desire to upgrade; I only ask because
> I am in the same position (Ubuntu 14.04, xfs, sysvinit), though happily
> with a smaller cluster. Personally I was planning to upgrade ours entirely
> to Luminous while still on Ubuntu 14.04, before later going through the
> same process of decommissioning one machine at a time to reinstall with
> CentOS 7 and Bluestore. I too don't see any reason the mixed Jewel/Luminous
> cluster wouldn't work, but still felt less comfortable with extending the
> upgrade duration.
> >
> > Graham
>
> Yes, you can run luminous on Trusty; one of my clusters is currently
> Luminous/Bluestore/Trusty as I've not had time to sort out doing OS
> upgrades on it. I second the suggestion that it would be better to do the
> luminous upgrade first, retaining existing filestore OSDs, and then do the
> OS upgrade/OSD recreation on each node in sequence. I don't think there
> should realistically be any problems with running a mixed cluster for a
> while but doing the jewel->luminous upgrade on the existing installs first
> shouldn't be significant extra effort/time as you're already predicting at
> least two months to upgrade everything, and it does minimise the amount of
> change at any one time in case things do start going horribly wrong.
>
> Also, at 48 nodes, I would've thought you could get away with cycling more
> than one of them at once. Assuming they're homogenous taking out even 4 at
> a time should only raise utilisation on the rest of the cluster to a little
> over 65%, which still seems safe to me, and you'd waste way less time
> waiting for recovery. (I recognise that depending on the nature of your
> employment situation this may not actually be desirable...)
>
>
Assuming size=3 and min_size=2 and failure-domain=host:

I always thought that bringing down more then 1 host cause data
inaccessebility right away because the chance that a pg will have osd's in
these 2 hosts is there. Only if the failure-domain is higher then host
(rack or something) you can safely bring more then 1 host down (in the same
failure domain offcourse).

Am i right?

Kind regards,
Caspar


> Rich
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

2017-12-06 Thread Gonzalo Aguilar Delgado

Hi,

Another OSD falled down. And it's pretty scary how easy is to break the 
cluster. This time is something related to the journal.



/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
/var/lib/ceph/osd/ceph-6/journal
2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 log_to_monitors 
{default=true}
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
 3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
 4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
 7: (()+0x76ba) [0x7f24503e36ba]
 8: (clone()+0x6d) [0x7f244e45b3dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
2017-12-05 13:19:04.437968 7f243d1a0700 -1 os/filestore/FileStore.cc: In 
function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time 
2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
 3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
 4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
 7: (()+0x76ba) [0x7f24503e36ba]
 8: (clone()+0x6d) [0x7f244e45b3dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
 3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
 4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
 7: (()+0x76ba) [0x7f24503e36ba]
 8: (clone()+0x6d) [0x7f244e45b3dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
  -405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 
log_to_monitors {default=true}
 0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1 
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
 3: (FileStore::_do_transactions(std::vector&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
 4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
 7: (()+0x76ba) [0x7f24503e36ba]
 8: (clone()+0x6d) [0x7f244e45b3dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


2017-12-05 13:19:04.442866 7f243d9a1700 -1 

[ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread Jan Pekař - Imatic

Hi,
I run to overloaded cluster (deep-scrub running) for few seconds and 
rbd-nbd client timeouted, and device become unavailable.


block nbd0: Connection timed out
block nbd0: shutting down sockets
block nbd0: Connection timed out
print_req_error: I/O error, dev nbd0, sector 2131833856
print_req_error: I/O error, dev nbd0, sector 2131834112

Is there any way how to extend rbd-nbd timeout?

Also getting pammed devices failed -

rbd-nbd list-mapped

/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: In function 'int 
get_mapped_info(int, Config*)' thread 7f069d41ec40 time 2017-12-06 
09:40:33.541426
/build/ceph-12.2.2/src/tools/rbd_nbd/rbd-nbd.cc: 841: FAILED 
assert(ifs.is_open())
 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x7f0693f567c2]

 2: (()+0x14165) [0x559a8783d165]
 3: (main()+0x9) [0x559a87838e59]
 4: (__libc_start_main()+0xf1) [0x7f0691178561]
 5: (()+0xff80) [0x559a87838f80]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

Aborted


Thank you
With regards
Jan Pekar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

2017-12-06 Thread Alwin Antreich
Hello Marcus,
On Tue, Dec 05, 2017 at 07:09:35PM +0100, Marcus Priesch wrote:
> Dear Ceph Users,
>
> first of all, big thanks to all the devs and people who made all this
> possible, ceph is amazing !!!
>
> ok, so let me get to the point where i need your help:
>
> i have a cluster of 6 hosts, mixed with ssd's and hdd's.
>
> on 4 of the 6 hosts are 21 vm's running in total with less to no
> workload (web, mail, elasticsearch) for a couple of users.
>
> 4 nodes are running ubuntu server and 2 of them are running proxmox
> (because we are now in the process of migrating towards proxmox).
>
> i am running ceph luminous (have upgraded two weeks ago)
I guess, you are running on ceph 12.2.1 (12.2.2 is out)? What does ceph 
versions say?

>
> ceph communication is carried out on a seperate 1Gbit Network where we
> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
With 6 hosts you will need 10GbE, alone for lower latency. Also a ceph
recovery/rebalance might max out the bandwidth of your link.

>
> i have two pools defined where i only use disk images via libvirt/rbd.
>
> the hdd pool has two replicas and is for large (~4TB) backup images and
> the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
> for improved fail safety and faster access for "live data" and OS
> images.
Mixing of spinners with SSDs is not recommended, as spinners will slow
down the pools residing on that root.

>
> in the crush map i have two different rules for the two pools so that
> replicas always are stored on different hosts - i have verified this and
> it works. it is coded via the "host" attribute (host node1-hdd and host
> node1 are both actually on the same host)
>
> so, now comes the interesting part:
>
> when i turn off one of the hosts (lets say node7) that do only ceph,
> after some time the vm's stall and hang until the host comes up again.
A stall of I/O shouldn't happen, what is your min_size of the pools? How
is your 'ceph osd tree' looking?
>
> when i dont turn on the host again, after some time the cluster starts
> rebalancing ...
Expected.

>
> yesterday i experienced that after a couple of hours of rebalancing the
> vm's continue working again - i think thats when the cluster has
> finished rebalancing ? havent really digged into this.
See above.

>
> well, today we turned off the same host (node7) again and i got stuck
> pg's again.
>
> this time i did some investigation and to my surprise i found the
> following in the output of ceph health detail:
>
> REQUEST_SLOW 17 slow requests are blocked > 32 sec
> 3 ops are blocked > 2097.15 sec
> 14 ops are blocked > 1048.58 sec
> osds 9,10 have blocked requests > 1048.58 sec
> osd.5 has blocked requests > 2097.15 sec
>
> i think the blocked requests are my problem, do they ?
That is a symptom of the problem, see above.

>
> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
> tell me why the requests to this nodes got stuck ?
Those OSDs are waiting on other OSDs on host7, you can see that in the
ceph logs and you see with 'ceph pg dump' which pgs are located on which
OSDs.

>
> i have one pg in state "stuck unclean" which has its replicas on osd's
> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
> thought the "write op" should have gone there ... so why unclean ? the
> manual states "For stuck unclean placement groups, there is usually
> something preventing recovery from completing, like unfound objects" but
> there arent ...
unclean - The placement group has not been clean for too long (i.e., it
hasn’t been able to completely recover from a previous failure).
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups
How is your 1GbE utilized? I guess, with 6 nodes (3-4 OSDs) your link
might be maxed out. But you should get something in the ceph
logs.

>
> do i have a configuration issue here (amount of replicas?) or is this
> behavior simply just because my cluster network is too slow ?
>
> you can find detailed outputs here :
>
>   https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
>
> i hope any of you can help me shed any light on this ...
>
> at least the point of all is that a single host should be allowed to
> fail and the vm's continue running ... ;)
To get a better look at your setup, a crush map, ceph osd dump, ceph -s
and some log output would be nice.

Also you are moving to Proxmox, you might want to have look at the docs
& the forum.

Docs: https://pve.proxmox.com/pve-docs/
Forum: https://forum.proxmox.com
Some more useful information on PVE + Ceph: 
https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842

>
> regards and thanks in advance,
> marcus.
>
> --
> Marcus Priesch
> open source consultant - solution provider
> www.priesch.co.at / off...@priesch.co.at
> A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
> ___
> ceph-users mailing