[ceph-users] Ceph PVE cluster help

2019-08-26 Thread Daniel K
Have some friends I set up a Ceph cluster for use with PVE a few years ago.
It wasn't maintained and is now in bad shape.

They've reached out to me for help, but I do not have the time to assist
right now.

Is there anyone on the list that would be willing to help? As a
professional service of course.

Please reach out to me off-list.

Thanks,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-04 Thread Daniel K
Thanks for the suggestions.

I've tried both -- setting osd_find_best_info_ignore_history_les = true and
restarting all OSDs,  as well as 'ceph osd-force-create-pg' -- but both
still show incomplete

PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')
pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')


The OSDs in down_osds_we_would_probe have already been marked lost

When I ran  the force-create-pg command, they went to peering for a few
seconds, but then went back incomplete.

Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu
Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX

Any other suggestions?



Thanks again,

Daniel



On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich  wrote:

> On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone
>  wrote:
> >
> > If you have no way to recover the drives, you can try to reboot the OSDs
> with `osd_find_best_info_ignore_history_les = true` (revert it afterwards),
> you'll lose data. If after this, the PGs are down, you can mark the OSDs
> blocking the PGs from become active lost.
>
> this should work for PG 18.1e, but not for 18.c. Try running "ceph osd
> force-create-pg " to reset the PGs instead.
> Data will obviously be lost afterwards.
>
> Paul
>
> >
> > On Sat, Mar 2, 2019 at 6:08 AM Daniel K  wrote:
> >>
> >> They all just started having read errors. Bus resets. Slow reads. Which
> is one of the reasons the cluster didn't recover fast enough to compensate.
> >>
> >> I tried to be mindful of the drive type and specifically avoided the
> larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for
> the WAL.
> >>
> >> Not sure why they failed. The data isn't critical at this point, just
> need to get the cluster back to normal.
> >>
> >> On Sat, Mar 2, 2019, 9:00 AM  wrote:
> >>>
> >>> Did they break, or did something went wronng trying to replace them?
> >>>
> >>> Jespe
> >>>
> >>>
> >>>
> >>> Sent from myMail for iOS
> >>>
> >>>
> >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K  >:
> >>>
> >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue
> 5400rpm 2.5 inch laptop drives.
> >>>
> >>> They've been replace now with HGST 10K 1.8TB SAS drives.
> >>>
> >>>
> >>>
> >>> On Sat, Mar 2, 2019, 12:04 AM  wrote:
> >>>
> >>>
> >>>
> >>> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com <
> satha...@gmail.com>:
> >>>
> >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox
> >>>
> >>> We had multiple drives fail(about 30%) within a few days of each
> other, likely faster than the cluster could recover.
> >>>
> >>>
> >>> Hov did so many drives break?
> >>>
> >>> Jesper
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-02 Thread Daniel K
They all just started having read errors. Bus resets. Slow reads. Which is
one of the reasons the cluster didn't recover fast enough to compensate.

I tried to be mindful of the drive type and specifically avoided the larger
capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for the WAL.

Not sure why they failed. The data isn't critical at this point, just need
to get the cluster back to normal.

On Sat, Mar 2, 2019, 9:00 AM  wrote:

> Did they break, or did something went wronng trying to replace them?
>
> Jespe
>
>
>
> Sent from myMail for iOS
>
>
> Saturday, 2 March 2019, 14.34 +0100 from Daniel K :
>
> I bought the wrong drives trying to be cheap. They were 2TB WD Blue
> 5400rpm 2.5 inch laptop drives.
>
> They've been replace now with HGST 10K 1.8TB SAS drives.
>
>
>
> On Sat, Mar 2, 2019, 12:04 AM  wrote:
>
>
>
> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com <
> satha...@gmail.com>:
>
> 56 OSD, 6-node 12.2.5 cluster on Proxmox
>
> We had multiple drives fail(about 30%) within a few days of each other,
> likely faster than the cluster could recover.
>
>
> Hov did so many drives break?
>
> Jesper
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-02 Thread Daniel K
I bought the wrong drives trying to be cheap. They were 2TB WD Blue 5400rpm
2.5 inch laptop drives.

They've been replace now with HGST 10K 1.8TB SAS drives.



On Sat, Mar 2, 2019, 12:04 AM  wrote:

>
>
> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com <
> satha...@gmail.com>:
>
> 56 OSD, 6-node 12.2.5 cluster on Proxmox
>
> We had multiple drives fail(about 30%) within a few days of each other,
> likely faster than the cluster could recover.
>
>
> Hov did so many drives break?
>
> Jesper
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-01 Thread Daniel K
56 OSD, 6-node 12.2.5 cluster on Proxmox

We had multiple drives fail(about 30%) within a few days of each other,
likely faster than the cluster could recover.

After the dust settled, we have 2 out of 896 pgs stuck inactive. The failed
drives are completely inaccessible, so I can't mount them and attempt to
export the PGs.

Do I have any options besides to just consider them lost -- and how do I
tell Ceph they are lost so that I can get my cluster back to normal? I
already reduced min_size from 9 to 8, can't reduce it any more. The list of
"down_osds_we_would_probe" have already all been marked as lost (ceph osd
lost xx)

ceph health detail:

PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')
pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16]
(reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
for 'incomplete')


root@pve4:~# ceph osd erasure-code-profile get ec-84-hdd
crush-device-class=
crush-failure-domain=host
crush-root=default
k=8
m=4
plugin=isa
technique=reed_sol_van

Results of ceph pg 18.c query https://pastebin.com/V8nByRF6
Results of ceph pg 18.1e query https://pastebin.com/rBWwPYUn

Thanks

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD crashing on 12.2.0. Bluestore / EC pool / rbd

2018-12-20 Thread Daniel K
I'm hitting this same issue on 12.2.5. Upgraded one node to 12.2.10 and it
didn't clear.

6 OSDs flapping with this error. I know this is an older issue but are
traces still needed? I don't see a resolution available.

Thanks,

Dan

On Wed, Sep 6, 2017 at 10:30 PM Brad Hubbard  wrote:

> These error logs look like they are being generated here,
>
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8987-L8993
> or possibly here,
>
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L9230-L9236
> .
>
> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> 2017-09-05 17:02:58.686723 7fe1871ac700 -1
> bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transaction error (2) No
> such file or directory not handled on operation 15 (op 0, counting
> from 0)
>
> The table of operations is here,
> https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h#L370
>
> Operation 15 is OP_SETATTRS so it appears to be some extended
> attribute operation that is failing.
>
> Can someone run the ceph-osd under strace and find the last system
> call (probably a call that manipulates xattrs) that returns -2 in the
> thread that crashes (or that outputs the above messages)?
>
> strace -fvttyyTo /tmp/strace.out -s 1024 ceph-osd [system specific
> argumentsarguments]
>
> Capturing logs with "debug_bluestore = 20" may tell us more as well.
>
> We need to work out what resource it is trying to access when it
> receives the error '2' (No such file or directory).
>
>
> On Thu, Sep 7, 2017 at 12:13 AM, Thomas Coelho
>  wrote:
> > Hi,
> >
> > I have the same problem. A bug [1] is reported since months, but
> > unfortunately this is not fixed yet. I hope, if more people are having
> > this problem the developers can reproduce and fix it.
> >
> > I was using Kernel-RBD with a Cache Tier.
> >
> > so long
> > Thomas Coelho
> >
> > [1] http://tracker.ceph.com/issues/20222
> >
> >
> > Am 06.09.2017 um 15:41 schrieb Henrik Korkuc:
> >> On 17-09-06 16:24, Jean-Francois Nadeau wrote:
> >>> Hi,
> >>>
> >>> On a 4 node / 48 OSDs Luminous cluster Im giving a try at RBD on EC
> >>> pools + Bluestore.
> >>>
> >>> Setup went fine but after a few bench runs several OSD are failing and
> >>> many wont even restart.
> >>>
> >>> ceph osd erasure-code-profile set myprofile \
> >>>k=2\
> >>>m=1 \
> >>>crush-failure-domain=host
> >>> ceph osd pool create mypool 1024 1024 erasure myprofile
> >>> ceph osd pool set mypool allow_ec_overwrites true
> >>> rbd pool init mypool
> >>> ceph -s
> >>> ceph health detail
> >>> ceph osd pool create metapool 1024 1024 replicated
> >>> rbd create --size 1024G --data-pool mypool --image metapool/test1
> >>> rbd bench -p metapool test1 --io-type write --io-size 8192
> >>> --io-pattern rand --io-total 10G
> >>> ...
> >>>
> >>>
> >>> One of many OSD failing logs
> >>>
> >>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs systemd[1]:
> >>> Started Ceph object storage daemon osd.12.
> >>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>> starting osd.12 at - osd_data /var/lib/ceph/osd/ceph-12
> >>> /var/lib/ceph/osd/ceph-12/journal
> >>> Sep 05 17:02:56 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>> 2017-09-05 17:02:56.627301 7fe1a2e42d00 -1 osd.12 2219 log_to_monitors
> >>> {default=true}
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>> 2017-09-05 17:02:58.686723 7fe1871ac700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transac
> >>> tion error (2) No such file or directory not handled on operation 15
> >>> (op 0, counting from 0)
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>> 2017-09-05 17:02:58.686742 7fe1871ac700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-12) unexpected error
> >>>  code
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/
> >>>
> centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc:
> >>> In function 'void BlueStore::_txc_add_transaction(Blu
> >>> eStore::TransContext*, ObjectStore::Transaction*)' thread 7fe1871ac700
> >>> time 2017-09-05 17:02:58.686821
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/
> >>>
> centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc:
> >>> 9282: FAILED assert(0 == "unexpected error")
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
> >>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> >>> luminous (rc)
> >>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 1:
> >>> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>> const*)+0x110) [0x7fe1a38bf510]
> >>> Sep 05 

Re: [ceph-users] Ceph health error (was: Prioritize recovery over backfilling)

2018-12-20 Thread Daniel K
Did you ever get anywhere with this?

I have 6 OSDs out of 36 continuously flapping with this error in the logs.

Thanks,

Dan

On Fri, Jun 8, 2018 at 11:10 AM Caspar Smit  wrote:

> Hi all,
>
> Maybe this will help:
>
> The issue is with shards 3,4 and 5 of PG 6.3f:
>
> LOG's of OSD's 16, 17 & 36 (the ones crashing on startup).
>
>
> *Log OSD.16 (shard 4):*
>
> 2018-06-08 08:35:01.727261 7f4c585e3700 -1
> bluestore(/var/lib/ceph/osd/ceph-16) _txc_add_transaction error (2) No such
> file or directory not handled on operation 30 (op 0, counting from 0)
> 2018-06-08 08:35:01.727273 7f4c585e3700 -1
> bluestore(/var/lib/ceph/osd/ceph-16) ENOENT on clone suggests osd bug
> 2018-06-08 08:35:01.727274 7f4c585e3700  0
> bluestore(/var/lib/ceph/osd/ceph-16)  transaction dump:
> {
> "ops": [
> {
> "op_num": 0,
> "op_name": "clonerange2",
> "collection": "6.3fs4_head",
> "src_oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903d0",
> "dst_oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#",
> "src_offset": 950272,
> "len": 98304,
> "dst_offset": 950272
> },
> {
> "op_num": 1,
> "op_name": "remove",
> "collection": "6.3fs4_head",
> "oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903d0"
> },
> {
> "op_num": 2,
> "op_name": "setattrs",
> "collection": "6.3fs4_head",
> "oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#",
> "attr_lens": {
> "_": 297,
> "hinfo_key": 18,
> "snapset": 35
> }
> },
> {
> "op_num": 3,
> "op_name": "clonerange2",
> "collection": "6.3fs4_head",
> "src_oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903cf",
> "dst_oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#",
> "src_offset": 679936,
> "len": 274432,
> "dst_offset": 679936
> },
> {
> "op_num": 4,
> "op_name": "remove",
> "collection": "6.3fs4_head",
> "oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#903cf"
> },
> {
> "op_num": 5,
> "op_name": "setattrs",
> "collection": "6.3fs4_head",
> "oid":
> "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.000312db:head#",
> "attr_lens": {
> "_": 297,
> "hinfo_key": 18,
> "snapset": 35
> }
> },
> {
> "op_num": 6,
> "op_name": "nop"
> },
> {
> "op_num": 7,
> "op_name": "op_omap_rmkeyrange",
> "collection": "6.3fs4_head",
> "oid": "4#6:fc00head#",
> "first": "011124.00590799",
> "last": "4294967295.18446744073709551615"
> },
> {
> "op_num": 8,
> "op_name": "omap_setkeys",
> "collection": "6.3fs4_head",
> "oid": "4#6:fc00head#",
> "attr_lens": {
> "_biginfo": 597,
> "_epoch": 4,
> "_info": 953,
> "can_rollback_to": 12,
> "rollback_info_trimmed_to": 12
> }
> }
> ]
> }
>
> 2018-06-08 08:35:01.730584 7f4c585e3700 -1
> /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function
> 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)' thread 7f4c585e3700 time 2018-06-08
> 08:35:01.727379
> /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363:
> FAILED assert(0 == "unexpected error")
>
>  ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x558e08ba4202]
>  2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x15fa) [0x558e08a55c3a]
>  3: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> std::vector std::allocator >&,
> boost::intrusive_ptr, ThreadPool::TPHandle*)+0x546)
> [0x558e08a572a6]
>  4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*,
> ObjectStore::Transaction&&, Context*, Context*, Context*,
> boost::intrusive_ptr, ThreadPool::TPHandle*)+0x14f)
> [0x558e085fa37f]
>  5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*,
> ThreadPool::TPHandle*)+0x6c) [0x558e0857db5c]
>  6: (OSD::process_peering_events(std::__cxx11::list std::allocator > const&, ThreadPool::TPHandle&)+0x442) [0x558e085abec2]
>  7: (ThreadPool::BatchWorkQueue::_void_process(void*,
> 

[ceph-users] 12.2.5 multiple OSDs crashing

2018-12-19 Thread Daniel K
12.2.5 on Proxmox cluster.

6 nodes, about 50 OSDs, bluestore and cache tiering on an EC pool. Mostly
spinners with an SSD OSD drive and an SSD WAL DB drive on each node. PM863
SSDs with ~75%+ endurance remaning.

Has been running relatively okay besides some spinner failures until I
checked today and found 5-6 OSDs flapping. I remember reading about some
issues with 12.2.5, so I upgraded one node to 12.2.10 but no change.

Seeing:
2018-12-20 00:27:42.754485 7f578f68a700 -1
bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such
file or directory not handled on operation 30 (op 0, counting from 0)
-3> 2018-12-20 00:27:42.754503 7f578f68a700 -1
bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug

in the logs for each of them.

I've found several bugs in the tracker related to these, but nothing with a
resolution I could apply besides upgrading, which doesn't appear to have
helped.

Suggestions welcome.


Snippet of the last few lines:

rbd_data.17.afb3726b8b4567.000db0d8:head (bitwise)
local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f
15781/4127/0 59341/59343/15778)
[28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0)
r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped
NOTIFY mbc={}] enter Started/ReplicaActive
-6> 2018-12-20 00:27:42.753106 7f578f68a700  5 osd.33 pg_epoch: 59344
pg[18.10s5( v 15786'3211273 (15781'3201224,15786'3211273] lb
18:0af1fd67:::rbd_data.17.afb3726b8b4567.000db0d8:head (bitwise)
local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f
15781/4127/0 59341/59343/15778)
[28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0)
r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped
NOTIFY mbc={}] enter Started/ReplicaActive/RepNotRecovering
-5> 2018-12-20 00:27:42.753186 7f578f68a700  5 write_log_and_missing
with: dirty_to: 0'0, dirty_from: 15786'3211274, writeout_from:
4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-4> 2018-12-20 00:27:42.754485 7f578f68a700 -1
bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such
file or directory not handled on operation 30 (op 0, counting from 0)
-3> 2018-12-20 00:27:42.754503 7f578f68a700 -1
bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug
-2> 2018-12-20 00:27:42.754507 7f578f68a700  0
bluestore(/var/lib/ceph/osd/ceph-33)  transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "clonerange2",
"collection": "18.10s5_head",
"src_oid":
"5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#310014",
"dst_oid":
"5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#",
"src_offset": 512000,
"len": 8192,
"dst_offset": 512000
},
{
"op_num": 1,
"op_name": "remove",
"collection": "18.10s5_head",
"oid":
"5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#310014"
},
{
"op_num": 2,
"op_name": "setattrs",
"collection": "18.10s5_head",
"oid":
"5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.002ad142:head#",
"attr_lens": {
"_": 298,
"hinfo_key": 18,
"snapset": 35
}
},
{
"op_num": 3,
"op_name": "nop"
},
{
"op_num": 4,
"op_name": "op_omap_rmkeyrange",
"collection": "18.10s5_head",
"oid": "5#18:0800head#",
"first": "015786.03211274",
"last": "4294967295.18446744073709551615"
},
{
"op_num": 5,
"op_name": "omap_setkeys",
"collection": "18.10s5_head",
"oid": "5#18:0800head#",
"attr_lens": {
"_biginfo": 1646,
"_epoch": 4,
"_info": 1014,
"can_rollback_to": 12,
"rollback_info_trimmed_to": 12
}
}
]
}

-1> 2018-12-20 00:27:42.757231 7f5795696700  1 --
10.10.145.105:6801/29516 --> 10.10.145.100:6818/5468876 --
pg_info((query:59344 sent:59344 18.1es9( v 15786'3322304
(15786'3312225,15786'3322304] local-lis/les=59343/59344 n=163868 ec=408/408
lis/c 59343/3966 les/c/f 59344/3980/0 59341/59343/15773) 9->0)=(empty)
epoch 59344) v5 -- 0x55c2f7e7be00 con 0
 0> 2018-12-20 00:27:42.758519 7f578f68a700 -1
/home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: In
function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)' thread 7f578f68a700 time 2018-12-20
00:27:42.754596
/home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: 

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Daniel K
I had a similar problem with some relatively underpowered servers (2x
E5-2603 6 core 1.7ghz no HT, 12-14 2TB OSDs per server, 32Gb RAM)

There was a process on a couple of the servers that would hang and chew up
all available CPU. When that happened, I started getting scrub errors on
those servers.



On Mon, Mar 5, 2018 at 8:45 AM, Jan Marquardt  wrote:

> Am 05.03.18 um 13:13 schrieb Ronny Aasen:
> > i had some similar issues when i started my proof of concept. especialy
> > the snapshot deletion i remember well.
> >
> > the rule of thumb for filestore that i assume you are running is 1GB ram
> > per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for
> > osd's + some  GB's for the mon service, + some GB's  for the os itself.
> >
> > i suspect if you inspect your dmesg log and memory graphs you will find
> > that the out of memory killer ends your osd's when the snap deletion (or
> > any other high load task) runs.
> >
> > I ended up reducing the number of osd's per node, since the old
> > mainboard i used was maxed for memory.
>
> Well, thanks for the broad hint. Somehow I assumed we fulfill the
> recommendations, but of course you are right. We'll check if our boards
> support 48 GB RAM. Unfortunately, there are currently no corresponding
> messages. But I can't rule out that there haven't been any.
>
> > corruptions occured for me as well. and they was normaly associated with
> > disks dying or giving read errors. ceph often managed to fix them but
> > sometimes i had to just remove the hurting OSD disk.
> >
> > hage some graph's  to look at. personaly i used munin/munin-node since
> > it was just an apt-get away from functioning graphs
> >
> > also i used smartmontools to send me emails about hurting disks.
> > and smartctl to check all disks for errors.
>
> I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are
> always caused by disk problems or if they also could be triggered
> by flapping OSDs or other circumstances.
>
> > good luck with ceph !
>
> Thank you!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-02 Thread Daniel K
There's been quite a few VMWare/Ceph threads on the mailing list in the
past.

One setup I've been toying with is a linux guest running on the vmware host
on local storage, with the guest mounting a ceph RBD with a filesystem on
it, then exporting that via NFS to the VMWare host as a datastore.

Exporting CephFS via NFS to Vmware is another option.

I'm not sure how well shared storage will work with either of these
configurations. but they work fairly well for single-host deployments.

There are also quite a few products that do support iscsi on ceph. Suse
Enterprise Storage is a commercial one, PetaSAN is an open-source option.


On Fri, Mar 2, 2018 at 2:24 AM, Joshua Chen 
wrote:

> Dear all,
>   I wonder how we could support VM systems with ceph storage (block
> device)? my colleagues are waiting for my answer for vmware (vSphere 5) and
> I myself use oVirt (RHEV). the default protocol is iSCSI.
>   I know that openstack/cinder work well with ceph and proxmox (just
> heard) too. But currently we are using vmware and ovirt.
>
>
> Your wise suggestion is appreciated
>
> Cheers
> Joshua
>
>
> On Thu, Mar 1, 2018 at 3:16 AM, Mark Schouten  wrote:
>
>> Does Xen still not support RBD? Ceph has been around for years now!
>>
>> Met vriendelijke groeten,
>>
>> --
>> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
>> Mark Schouten | Tuxis Internet Engineering
>> KvK: 61527076 | http://www.tuxis.nl/
>> T: 0318 200208 | i...@tuxis.nl
>>
>>
>>
>> * Van: * Massimiliano Cuttini 
>> * Aan: * "ceph-users@lists.ceph.com" 
>> * Verzonden: * 28-2-2018 13:53
>> * Onderwerp: * [ceph-users] Ceph iSCSI is a prank?
>>
>> I was building ceph in order to use with iSCSI.
>> But I just see from the docs that need:
>>
>> *CentOS 7.5*
>> (which is not available yet, it's still at 7.4)
>> https://wiki.centos.org/Download
>>
>> *Kernel 4.17*
>> (which is not available yet, it is still at 4.15.7)
>> https://www.kernel.org/
>>
>> So I guess, there is no ufficial support and this is just a bad prank.
>>
>> Ceph is ready to be used with S3 since many years.
>> But need the kernel of the next century to works with such an old
>> technology like iSCSI.
>> So sad.
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Added two OSDs, 10% of pgs went inactive

2017-12-21 Thread Daniel K
Caspar,

I found Nick Fisk's post yesterday
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023223.html
and set osd_max_pg_per_osd_hard_ratio = 4 in my ceph.conf on the OSDs and
restarted the 10TB OSDs. The PGs went back active and recovery is complete
now.

My setup is similar to his in that there's a large difference in OSD size,
most are 1.8TB, but about 10% of them are 10TB.

The difference is I had a functional Luminous cluster, until increased the
number 10TB OSDs from 6 to 8. I'm still not sure why that caused *more* PGs
per OSD with the same pools.

Thanks!

Daniel


On Wed, Dec 20, 2017 at 10:23 AM, Caspar Smit <caspars...@supernas.eu>
wrote:

> Hi Daniel,
>
> I've had the same problem with creating a new 12.2.2 cluster where i
> couldn't get some pgs out of the "activating+remapped" status after i
> switched some OSD's from one chassis to another (there was no data on it
> yet).
>
> I tried restarting OSD's to no avail.
>
> Couldn't find anything about the stuck in "activating+remapped" state so
> in the end i threw away the pool and started over.
>
> Could this be a bug in 12.2.2 ?
>
> Kind regards,
> Caspar
>
> 2017-12-20 15:48 GMT+01:00 Daniel K <satha...@gmail.com>:
>
>> Just an update.
>>
>> Recovery completed but the PGS are still inactive.
>>
>> Still having a hard time understanding why adding OSDs caused this. I'm
>> on 12.2.2
>>
>> user@admin:~$ ceph -s
>>   cluster:
>> id: a3672c60-3051-440c-bd83-8aff7835ce53
>> health: HEALTH_WARN
>> Reduced data availability: 307 pgs inactive
>> Degraded data redundancy: 307 pgs unclean
>>
>>   services:
>> mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,sto
>> r585r2u16a,stor585r2u20a,stor585r2u24a
>> mgr: stor585r2u8a(active)
>> osd: 88 osds: 87 up, 87 in; 133 remapped pgs
>>
>>   data:
>> pools:   12 pools, 3016 pgs
>> objects: 387k objects, 1546 GB
>> usage:   3313 GB used, 186 TB / 189 TB avail
>> pgs: 10.179% pgs not active
>>  2709 active+clean
>>  174  activating
>>  133  activating+remapped
>>
>>   io:
>> client:   8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr
>>
>>
>> On Tue, Dec 19, 2017 at 8:57 PM, Daniel K <satha...@gmail.com> wrote:
>>
>>> I'm trying to understand why adding OSDs would cause pgs to go inactive.
>>>
>>> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k"
>>>
>>> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10%
>>> of pgs went inactive.
>>>
>>> I have an EC pool on these OSDs with the profile:
>>> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2
>>> crush-device-class=hdd_10TB_7.2k
>>> crush-failure-domain=host
>>> crush-root=default
>>> k=4
>>> m=2
>>> plugin=isa
>>> technique=reed_sol_van.
>>>
>>> some outputs of ceph health detail and ceph osd df
>>> user@admin:~$ ceph osd df |grep 10TB
>>> 76 hdd_10TB_7.2k 9.09509  1.0 9313G   349G 8963G 3.76 2.20 488
>>> 20 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 489
>>> 28 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8968G 3.70 2.17 484
>>> 36 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 484
>>> 87 hdd_10TB_7.2k 9.09560  1.0 9313G  8936M 9305G 0.09 0.05 311
>>> 86 hdd_10TB_7.2k 9.09560  1.0 9313G  8793M 9305G 0.09 0.05 304
>>>  6 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.16 471
>>> 68 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.17 480
>>> user@admin:~$ ceph health detail|grep inactive
>>> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data
>>> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean
>>> PG_AVAILABILITY Reduced data availability: 307 pgs inactive
>>> pg 24.60 is stuck inactive for 1947.792377, current state
>>> activating+remapped, last acting [36,20,76,6,68,28]
>>> pg 24.63 is stuck inactive for 1946.571425, current state
>>> activating+remapped, last acting [28,76,6,20,68,36]
>>> pg 24.71 is stuck inactive for 1947.625988, current state
>>> activating+remapped, last acting [6,68,20,36,28,76]
>>> pg 24.73 is stuck inactive for 1947.705250, current state
>>> activating+remapped, last acting [36,6,20,76,68,28]
>>> pg 24.74 is stuck inactive for 1947.828063, current state
>&g

Re: [ceph-users] Added two OSDs, 10% of pgs went inactive

2017-12-20 Thread Daniel K
Just an update.

Recovery completed but the PGS are still inactive.

Still having a hard time understanding why adding OSDs caused this. I'm on
12.2.2

user@admin:~$ ceph -s
  cluster:
id: a3672c60-3051-440c-bd83-8aff7835ce53
health: HEALTH_WARN
Reduced data availability: 307 pgs inactive
Degraded data redundancy: 307 pgs unclean

  services:
mon: 5 daemons, quorum
stor585r2u8a,stor585r2u12a,stor585r2u16a,stor585r2u20a,stor585r2u24a
mgr: stor585r2u8a(active)
osd: 88 osds: 87 up, 87 in; 133 remapped pgs

  data:
pools:   12 pools, 3016 pgs
objects: 387k objects, 1546 GB
usage:   3313 GB used, 186 TB / 189 TB avail
pgs: 10.179% pgs not active
 2709 active+clean
 174  activating
 133  activating+remapped

  io:
client:   8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr


On Tue, Dec 19, 2017 at 8:57 PM, Daniel K <satha...@gmail.com> wrote:

> I'm trying to understand why adding OSDs would cause pgs to go inactive.
>
> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k"
>
> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% of
> pgs went inactive.
>
> I have an EC pool on these OSDs with the profile:
> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2
> crush-device-class=hdd_10TB_7.2k
> crush-failure-domain=host
> crush-root=default
> k=4
> m=2
> plugin=isa
> technique=reed_sol_van.
>
> some outputs of ceph health detail and ceph osd df
> user@admin:~$ ceph osd df |grep 10TB
> 76 hdd_10TB_7.2k 9.09509  1.0 9313G   349G 8963G 3.76 2.20 488
> 20 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 489
> 28 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8968G 3.70 2.17 484
> 36 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 484
> 87 hdd_10TB_7.2k 9.09560  1.0 9313G  8936M 9305G 0.09 0.05 311
> 86 hdd_10TB_7.2k 9.09560  1.0 9313G  8793M 9305G 0.09 0.05 304
>  6 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.16 471
> 68 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.17 480
> user@admin:~$ ceph health detail|grep inactive
> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data
> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean
> PG_AVAILABILITY Reduced data availability: 307 pgs inactive
> pg 24.60 is stuck inactive for 1947.792377, current state
> activating+remapped, last acting [36,20,76,6,68,28]
> pg 24.63 is stuck inactive for 1946.571425, current state
> activating+remapped, last acting [28,76,6,20,68,36]
> pg 24.71 is stuck inactive for 1947.625988, current state
> activating+remapped, last acting [6,68,20,36,28,76]
> pg 24.73 is stuck inactive for 1947.705250, current state
> activating+remapped, last acting [36,6,20,76,68,28]
> pg 24.74 is stuck inactive for 1947.828063, current state
> activating+remapped, last acting [68,36,28,20,6,76]
> pg 24.75 is stuck inactive for 1947.475644, current state
> activating+remapped, last acting [6,28,76,36,20,68]
> pg 24.76 is stuck inactive for 1947.712046, current state
> activating+remapped, last acting [20,76,6,28,68,36]
> pg 24.78 is stuck inactive for 1946.576304, current state
> activating+remapped, last acting [76,20,68,36,6,28]
> pg 24.7a is stuck inactive for 1947.820932, current state
> activating+remapped, last acting [36,20,28,68,6,76]
> pg 24.7b is stuck inactive for 1947.858305, current state
> activating+remapped, last acting [68,6,20,28,76,36]
> pg 24.7c is stuck inactive for 1947.753917, current state
> activating+remapped, last acting [76,6,20,36,28,68]
> pg 24.7d is stuck inactive for 1947.803229, current state
> activating+remapped, last acting [68,6,28,20,36,76]
> pg 24.7f is stuck inactive for 1947.792506, current state
> activating+remapped, last acting [28,20,76,6,68,36]
> pg 24.8a is stuck inactive for 1947.823189, current state
> activating+remapped, last acting [28,76,20,6,36,68]
> pg 24.8b is stuck inactive for 1946.579755, current state
> activating+remapped, last acting [76,68,20,28,6,36]
> pg 24.8c is stuck inactive for 1947.555872, current state
> activating+remapped, last acting [76,36,68,6,28,20]
> pg 24.8d is stuck inactive for 1946.589814, current state
> activating+remapped, last acting [36,6,28,76,68,20]
> pg 24.8e is stuck inactive for 1947.802894, current state
> activating+remapped, last acting [28,6,68,36,76,20]
> pg 24.8f is stuck inactive for 1947.528603, current state
> activating+remapped, last acting [76,28,6,68,20,36]
> pg 25.60 is stuck inactive for 1947.620823, current state activating,
> last acting [20,6,87,36,28,68]
> pg 25.61 is stuck inacti

[ceph-users] Added two OSDs, 10% of pgs went inactive

2017-12-19 Thread Daniel K
I'm trying to understand why adding OSDs would cause pgs to go inactive.

This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k"

I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% of
pgs went inactive.

I have an EC pool on these OSDs with the profile:
user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2
crush-device-class=hdd_10TB_7.2k
crush-failure-domain=host
crush-root=default
k=4
m=2
plugin=isa
technique=reed_sol_van.

some outputs of ceph health detail and ceph osd df
user@admin:~$ ceph osd df |grep 10TB
76 hdd_10TB_7.2k 9.09509  1.0 9313G   349G 8963G 3.76 2.20 488
20 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 489
28 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8968G 3.70 2.17 484
36 hdd_10TB_7.2k 9.09509  1.0 9313G   345G 8967G 3.71 2.17 484
87 hdd_10TB_7.2k 9.09560  1.0 9313G  8936M 9305G 0.09 0.05 311
86 hdd_10TB_7.2k 9.09560  1.0 9313G  8793M 9305G 0.09 0.05 304
 6 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.16 471
68 hdd_10TB_7.2k 9.09509  1.0 9313G   344G 8969G 3.70 2.17 480
user@admin:~$ ceph health detail|grep inactive
HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data
availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean
PG_AVAILABILITY Reduced data availability: 307 pgs inactive
pg 24.60 is stuck inactive for 1947.792377, current state
activating+remapped, last acting [36,20,76,6,68,28]
pg 24.63 is stuck inactive for 1946.571425, current state
activating+remapped, last acting [28,76,6,20,68,36]
pg 24.71 is stuck inactive for 1947.625988, current state
activating+remapped, last acting [6,68,20,36,28,76]
pg 24.73 is stuck inactive for 1947.705250, current state
activating+remapped, last acting [36,6,20,76,68,28]
pg 24.74 is stuck inactive for 1947.828063, current state
activating+remapped, last acting [68,36,28,20,6,76]
pg 24.75 is stuck inactive for 1947.475644, current state
activating+remapped, last acting [6,28,76,36,20,68]
pg 24.76 is stuck inactive for 1947.712046, current state
activating+remapped, last acting [20,76,6,28,68,36]
pg 24.78 is stuck inactive for 1946.576304, current state
activating+remapped, last acting [76,20,68,36,6,28]
pg 24.7a is stuck inactive for 1947.820932, current state
activating+remapped, last acting [36,20,28,68,6,76]
pg 24.7b is stuck inactive for 1947.858305, current state
activating+remapped, last acting [68,6,20,28,76,36]
pg 24.7c is stuck inactive for 1947.753917, current state
activating+remapped, last acting [76,6,20,36,28,68]
pg 24.7d is stuck inactive for 1947.803229, current state
activating+remapped, last acting [68,6,28,20,36,76]
pg 24.7f is stuck inactive for 1947.792506, current state
activating+remapped, last acting [28,20,76,6,68,36]
pg 24.8a is stuck inactive for 1947.823189, current state
activating+remapped, last acting [28,76,20,6,36,68]
pg 24.8b is stuck inactive for 1946.579755, current state
activating+remapped, last acting [76,68,20,28,6,36]
pg 24.8c is stuck inactive for 1947.555872, current state
activating+remapped, last acting [76,36,68,6,28,20]
pg 24.8d is stuck inactive for 1946.589814, current state
activating+remapped, last acting [36,6,28,76,68,20]
pg 24.8e is stuck inactive for 1947.802894, current state
activating+remapped, last acting [28,6,68,36,76,20]
pg 24.8f is stuck inactive for 1947.528603, current state
activating+remapped, last acting [76,28,6,68,20,36]
pg 25.60 is stuck inactive for 1947.620823, current state activating,
last acting [20,6,87,36,28,68]
pg 25.61 is stuck inactive for 1947.883517, current state activating,
last acting [28,36,86,76,6,87]
pg 25.62 is stuck inactive for 1542089.552271, current state
activating, last acting [86,6,76,20,87,68]
pg 25.70 is stuck inactive for 1542089.729631, current state
activating, last acting [86,87,76,20,68,28]
pg 25.71 is stuck inactive for 1947.642271, current state activating,
last acting [28,86,68,20,6,36]
pg 25.75 is stuck inactive for 1947.825872, current state activating,
last acting [68,86,36,20,76,6]
pg 25.76 is stuck inactive for 1947.737307, current state activating,
last acting [36,87,28,6,68,76]
pg 25.77 is stuck inactive for 1947.218420, current state activating,
last acting [87,36,86,28,76,6]
pg 25.79 is stuck inactive for 1947.253871, current state activating,
last acting [6,36,86,28,68,76]
pg 25.7a is stuck inactive for 1542089.794085, current state
activating, last acting [86,36,68,20,76,87]
pg 25.7c is stuck inactive for 1947.666774, current state activating,
last acting [20,86,36,6,76,87]
pg 25.8a is stuck inactive for 1542089.687299, current state
activating, last acting [87,36,68,20,86,28]
pg 25.8c is stuck inactive for 1947.545965, current state activating,
last acting [76,6,28,87,36,86]
pg 25.8d is stuck inactive for 1947.213627, current state activating,
last acting 

Re: [ceph-users] Ceph re-ip of OSD node

2017-08-30 Thread Daniel K
Just curious why it wouldn't work as long as the IPs were reachable? Is
there something going on in layer 2 with Ceph that wouldn't survive a trip
across a router?



On Wed, Aug 30, 2017 at 1:52 PM, David Turner  wrote:

> ALL OSDs need to be running the same private network at the same time.
> ALL clients, RGW, OSD, MON, MGR, MDS, etc, etc need to be running on the
> same public network at the same time.  You cannot do this as a one at a
> time migration to the new IP space.  Even if all of the servers can still
> communicate via routing, it just won't work.  Changing the public/private
> network addresses for a cluster requires full cluster down time.
>
> On Wed, Aug 30, 2017 at 11:09 AM Ben Morrice  wrote:
>
>> Hello
>>
>> We have a small cluster that we need to move to a different network in
>> the same datacentre.
>>
>> My workflow was the following (for a single OSD host), but I failed
>> (further details below)
>>
>> 1) ceph osd set noout
>> 2) stop ceph-osd processes
>> 3) change IP, gateway, domain (short hostname is the same), VLAN
>> 4) change references of OLD IP (cluster and public network) in
>> /etc/ceph/ceph.conf with NEW IP (see [1])
>> 5) start a single OSD process
>>
>> This seems to work as the NEW IP can communicate with mon hosts and osd
>> hosts on the OLD network, the OSD is booted and is visible via 'ceph -w'
>> however after a few seconds the OSD drops with messages such as the
>> below in it's log file
>>
>> heartbeat_check: no reply from 10.1.1.100:6818 osd.14 ever on either
>> front or back, first ping sent 2017-08-30 16:42:14.692210 (cutoff
>> 2017-08-30 16:42:24.962245)
>>
>> There are logs like the above for every OSD server/process
>>
>> and then eventually a
>>
>> 2017-08-30 16:42:14.486275 7f6d2c966700  0 log_channel(cluster) log
>> [WRN] : map e85351 wrongly marked me down
>>
>>
>> Am I missing something obvious to reconfigure the network on a OSD host?
>>
>>
>>
>> [1]
>>
>> OLD
>> [osd.0]
>> host = sn01
>> devs = /dev/sdi
>> cluster addr = 10.1.1.101
>> public addr = 10.1.1.101
>> NEW
>> [osd.0]
>> host = sn01
>> devs = /dev/sdi
>> cluster addr = 10.1.2.101
>> public addr = 10.1.2.101
>>
>> --
>> Kind regards,
>>
>> Ben Morrice
>>
>> __
>> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
>> <+41%2021%20693%2096%2070>
>> EPFL / BBP
>> Biotech Campus
>> Chemin des Mines 9
>> 1202 Geneva
>> Switzerland
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD encryption options?

2017-08-24 Thread Daniel K
Awesome -- I searched and all I could find was restricting access at the
pool level

I will investigate the dm-crypt/RBD path also.


Thanks again!

On Thu, Aug 24, 2017 at 7:40 PM, Alex Gorbachev <a...@iss-integration.com>
wrote:

>
> On Mon, Aug 21, 2017 at 9:03 PM Daniel K <satha...@gmail.com> wrote:
>
>> Are there any client-side options to encrypt an RBD device?
>>
>> Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel
>>
>> I assumed adding client site encryption  would be as simple as using
>> luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap
>> and enabling the rbdmap service -- but I failed to consider the order of
>> things loading and it appears that the RBD gets mapped too late for
>> dm-crypt to recognize it as valid.It just keeps telling me it's not a valid
>> LUKS device.
>>
>> I know you can run the OSDs on an encrypted drive, but I was hoping for
>> something client side since it's not exactly simple(as far as I can tell)
>> to restrict client access to a single(or group) of RBDs within a shared
>> pool.
>>
>
> Daniel, we used info from here for single or multiple RBD mappings to
> client
>
> https://blog-fromsomedude.rhcloud.com/2016/04/26/
> Allowing-a-RBD-client-to-map-only-one-RBD
>
> Also, I ran into the race condition with zfs, and would up putting zfs and
> rbdmap into rc.local.  It should work for dm-crypt as well.
>
> Regards,
> Alex
>
>
>
>> Any suggestions?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --
> --
> Alex Gorbachev
> Storcium
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD encryption options?

2017-08-21 Thread Daniel K
Are there any client-side options to encrypt an RBD device?

Using latest luminous RC, on Ubuntu 16.04 and a 4.10 kernel

I assumed adding client site encryption  would be as simple as using
luks/dm-crypt/cryptsetup after adding the RBD device to /etc/ceph/rbdmap
and enabling the rbdmap service -- but I failed to consider the order of
things loading and it appears that the RBD gets mapped too late for
dm-crypt to recognize it as valid.It just keeps telling me it's not a valid
LUKS device.

I know you can run the OSDs on an encrypted drive, but I was hoping for
something client side since it's not exactly simple(as far as I can tell)
to restrict client access to a single(or group) of RBDs within a shared
pool.

Any suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] implications of losing the MDS map

2017-08-07 Thread Daniel K
I finally figured out how to get the ceph-monstore-tool (compiled from
source) and am ready to attemp to recover my cluster.

I have one question -- in the instructions,
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
under Recovery from OSDs, Known limitations:

->

   - *MDS Maps*: the MDS maps are lost.


What are the implications of this? Do I just need to rebuild this, or is
there a data loss component to it? -- Is my data stored in CephFS still
safe?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-monstore-tool missing in 12.1.1 on Xenial?

2017-07-30 Thread Daniel K
All 3 of my mons crashed while I was adding OSDs and now error out with:

 (/build/ceph-12.1.1/src/mon/OSDMonitor.cc: 3018: FAILED
assert(osdmap.get_up_osd_features() & CEPH_FEATURE_MON_STATEFUL_SUB)


I've resorted to just rebuilding the mon DB and making 3 new mon daemons,
using the steps here:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
under "Recovery using OSDs" but I am not finding the ceph-monstore-tool
anywhere.

Is there a different package I need to install or did this tool get
replaced with something else in Luminous?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Client behavior when OSD is unreachable

2017-07-27 Thread Daniel K
Does the client track which OSDs are reachable? How does it behave if some
are not reachable?

For example:

Cluster network with all OSD hosts on a switch.
Public network with OSD hosts split between two switches, failure domain is
switch.

copies=3 so with a failure of the public switch, 1 copy would be reachable
by client. Will the client know that it can't reach the OSDs on the failed
switch?

Well...thinking through this:
The mons communicate on the public network -- correct? So an unreachable
public network for some of the OSDs would cause them to be marked down,
which then the clients would know about.

Correct?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph object recovery

2017-07-27 Thread Daniel K
So I'm not sure if this was the best or right way to do this but --

using rados I confirmed the unfound object was in the cephfs_data pool
# rados -p cephfs_data ls|grep 001c0ed4

using the osdmap tool I found the pg/osd the unfound object was in --
# osdmaptool --test-map-object 162.001c0ed4 osdmap
(previously exported osdmap to file "osdmap")

>  object '162.001c0ed4' -> 1.21 -> [4]

then told ceph to just delete the unfound object
ceph pg 1.21 mark_unfound_lost delete

and then used rados to put the object back (from the file I had extracted
previously)
# rados -p cephfs_data put 162.001c0ed4 162.001c0ed4.obj


Still have more recovery to do but this seems to have fixed my unfound
object problem.


On Tue, Jul 25, 2017 at 12:54 PM, Daniel K <satha...@gmail.com> wrote:

> I did some bad things to my cluster, broke 5 OSDs and wound up with 1
> unfound object.
>
> I mounted one of the OSD drives, used ceph-objectstore-tool to find and
> exported the object:
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-10
> 162.001c0ed4 get-bytes filename.obj
>
>
> What's the best way to bring this object back into the active cluster?
>
> Do I need to bring an OSD offline, mount it and do the reverse of the
> above command?
>
> Something like:
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22
> 162.001c0ed4 set-bytes filename.obj
>
> Is there some way to do this without bringing down an osd?
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory

2017-07-25 Thread Daniel K
Just some more info -- this happens also when I just restart an OSD that
*was* working -- it won't start back.

In the mon log I have (which correspond to the OSDs that I've been trying
to start). osd.13 was working just now, before I stopped the service and
tried to start it again.

2017-07-25 14:42:49.249076 7f2386806700  0 cephx server osd.10: couldn't
find entity name: osd.10
2017-07-25 14:43:24.323603 7f2386806700  0 cephx server osd.13: couldn't
find entity name: osd.13
2017-07-25 14:43:25.033487 7f2386806700  0 cephx server osd.7: couldn't
find entity name: osd.7



Still reading and learning.


On Tue, Jul 25, 2017 at 2:38 PM, Daniel K <satha...@gmail.com> wrote:

> Update to this -- I tried building a new host and a new OSD, new disk, and
> I am having the same issue.
>
>
>
> I set osd debug level to 10 -- the issue looks like it's coming from a mon
> daemon. Still trying to learn enough about the internals of ceph to
> understand what's happening here.
>
> Relevant debug logs(I think)
>
>
> 2017-07-25 14:21:58.889016 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 1  mon_map magic: 0 v1  541+0+0
> (2831459213 0 0) 0x556640ecd900 con 0x556641949800
> 2017-07-25 14:21:58.889109 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 2  auth_reply(proto 2 0 (0) Success) v1 
> 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800
> 2017-07-25 14:21:58.889204 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400
> con 0
> 2017-07-25 14:21:58.889966 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 3  auth_reply(proto 2 0 (0) Success) v1 
> 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800
> 2017-07-25 14:21:58.890066 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80
> con 0
> 2017-07-25 14:21:58.890759 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 4  auth_reply(proto 2 0 (0) Success) v1 
> 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800
> 2017-07-25 14:21:58.890871 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0
> 2017-07-25 14:21:58.890901 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400
> con 0
> 2017-07-25 14:21:58.891494 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 5  mon_map magic: 0 v1  541+0+0
> (2831459213 0 0) 0x556640ecde00 con 0x556641949800
> 2017-07-25 14:21:58.891555 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 6  auth_reply(proto 2 0 (0) Success) v1 
> 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800
> 2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0
> mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]}
> 2017-07-25 14:21:58.892039 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0
> *2017-07-25 14:21:58.894596 7f25a88af700  1 -- 10.0.15.142:6800/16150
> <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0
> <http://10.0.15.51:6789/0> 7  mon_command_ack([{"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or
> directory v10406) v1  133+0+0 (3400959855 0 0) 0x556640ece300 con
> 0x556641949800*
> 2017-07-25 14:21:58.894797 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7,
> "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180
> con 0
> 2017-07-25 14:21:58.896301 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 8  mon_command_ack([{"prefix": "osd create",
> "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0  v10406) v1
>  115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800
> 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0
> mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]}
> 2017-07-25 14:21:58.896516 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789

Re: [ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory

2017-07-25 Thread Daniel K
Update to this -- I tried building a new host and a new OSD, new disk, and
I am having the same issue.



I set osd debug level to 10 -- the issue looks like it's coming from a mon
daemon. Still trying to learn enough about the internals of ceph to
understand what's happening here.

Relevant debug logs(I think)


2017-07-25 14:21:58.889016 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 1  mon_map magic: 0 v1  541+0+0 (2831459213
0 0) 0x556640ecd900 con 0x556641949800
2017-07-25 14:21:58.889109 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 2  auth_reply(proto 2 0 (0) Success) v1 
33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800
2017-07-25 14:21:58.889204 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400
con 0
2017-07-25 14:21:58.889966 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 3  auth_reply(proto 2 0 (0) Success) v1 
206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800
2017-07-25 14:21:58.890066 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80
con 0
2017-07-25 14:21:58.890759 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 4  auth_reply(proto 2 0 (0) Success) v1 
564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800
2017-07-25 14:21:58.890871 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0
2017-07-25 14:21:58.890901 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400 con
0
2017-07-25 14:21:58.891494 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 5  mon_map magic: 0 v1  541+0+0 (2831459213
0 0) 0x556640ecde00 con 0x556641949800
2017-07-25 14:21:58.891555 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 6  auth_reply(proto 2 0 (0) Success) v1 
194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800
2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create
cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}
2017-07-25 14:21:58.892039 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class",
"class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0
*2017-07-25 14:21:58.894596 7f25a88af700  1 -- 10.0.15.142:6800/16150
<http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0
<http://10.0.15.51:6789/0> 7  mon_command_ack([{"prefix": "osd crush
set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or
directory v10406) v1  133+0+0 (3400959855 0 0) 0x556640ece300 con
0x556641949800*
2017-07-25 14:21:58.894797 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7, "uuid":
"92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180 con 0
2017-07-25 14:21:58.896301 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
mon.1 10.0.15.51:6789/0 8  mon_command_ack([{"prefix": "osd create",
"id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0  v10406) v1
 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800
2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0 mon_cmd_maybe_osd_create
cmd: {"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["7"]}
2017-07-25 14:21:58.896516 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class",
"class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e793c0 con 0
*2017-07-25 14:21:58.898180 7f25a88af700  1 -- 10.0.15.142:6800/16150
<http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0
<http://10.0.15.51:6789/0> 9  mon_command_ack([{"prefix": "osd crush
set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or
directory v10406) v1  133+0+0 (3400959855 0 0) 0x556640ecd900 con
0x556641949800*
*2017-07-25 14:21:58.898276 7f25b5e71c80 -1 osd.7 0
mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such
file or directory*
2017-07-25 14:21:58.898380 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 >>
10.0.15.51:6789/0 conn(0x556641949800 :-1 s=STATE_OPEN pgs=367879 cs=1
l=1).mark_down




On Mon, Jul 24, 2017 at 1:33 PM, Daniel K <satha...@gmail.com> w

[ceph-users] Ceph object recovery

2017-07-25 Thread Daniel K
I did some bad things to my cluster, broke 5 OSDs and wound up with 1
unfound object.

I mounted one of the OSD drives, used ceph-objectstore-tool to find and
exported the object:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-10
162.001c0ed4 get-bytes filename.obj


What's the best way to bring this object back into the active cluster?

Do I need to bring an OSD offline, mount it and do the reverse of the above
command?

Something like:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22
162.001c0ed4 set-bytes filename.obj

Is there some way to do this without bringing down an osd?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can't start bluestore OSDs after sucessfully moving them 12.1.1 ** ERROR: osd init failed: (2) No such file or directory

2017-07-24 Thread Daniel K
List --

I have a 4-node cluster running on baremetal and have a need to use the
kernel client on 2 nodes. As I read you should not run the kernel client on
a node that runs an OSD daemon, I decided to move the OSD daemons into a VM
on the same device.

Orignal host is stor-vm2(bare metal), new host is stor-vm2a(Virtual)

All went well -- I did these steps(for each OSD, 5 total per host)

- setup the VM
- install the OS
- installed ceph(using ceph-deploy)
- set noout
- stopped ceph osd on bare metal host
- unmount /dev/sdb1 from /var/lib/ceph/osd/ceph-0
- add /dev/sdb to the VM
- ceph detected the osd and started automatically.
- moved VM host to the same bucket as physical host in crushmap

I did this for each OSD, and despite some recovery IO because of the
updated crushmap, all OSDs were up.

I rebooted the physical host, which rebooted the VM, and now the OSDs are
refusing to start.

I've tried moving them back to the bare metal host with the same results.

Any ideas?

Here are what seem to be the relevant osd log lines:

2017-07-24 13:21:53.561265 7faf1752fc80  0 osd.10 8854 crush map has
features 2200130813952, adjusting msgr requires for clients
2017-07-24 13:21:53.561284 7faf1752fc80  0 osd.10 8854 crush map has
features 2200130813952 was 8705, adjusting msgr requires for mons
2017-07-24 13:21:53.561298 7faf1752fc80  0 osd.10 8854 crush map has
features 720578140510109696, adjusting msgr requires for osds
2017-07-24 13:21:55.626834 7faf1752fc80  0 osd.10 8854 load_pgs
2017-07-24 13:22:20.970222 7faf1752fc80  0 osd.10 8854 load_pgs opened 536
pgs
2017-07-24 13:22:20.972659 7faf1752fc80  0 osd.10 8854 using
weightedpriority op queue with priority op cut off at 64.
2017-07-24 13:22:20.976861 7faf1752fc80 -1 osd.10 8854 log_to_monitors
{default=true}
2017-07-24 13:22:20.998233 7faf1752fc80 -1 osd.10 8854
mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such
file or directory
2017-07-24 13:22:20.999165 7faf1752fc80  1
bluestore(/var/lib/ceph/osd/ceph-10) umount
2017-07-24 13:22:21.016146 7faf1752fc80  1 freelist shutdown
2017-07-24 13:22:21.016243 7faf1752fc80  4 rocksdb:
[/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all
background work
2017-07-24 13:22:21.020440 7faf1752fc80  4 rocksdb:
[/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2017-07-24 13:22:21.274481 7faf1752fc80  1 bluefs umount
2017-07-24 13:22:21.275822 7faf1752fc80  1 bdev(0x558bb1f82d80
/var/lib/ceph/osd/ceph-10/block) close
2017-07-24 13:22:21.485226 7faf1752fc80  1 bdev(0x558bb1f82b40
/var/lib/ceph/osd/ceph-10/block) close
2017-07-24 13:22:21.551009 7faf1752fc80 -1  ** ERROR: osd init failed: (2)
No such file or directory
2017-07-24 13:22:21.563567 7faf1752fc80 -1
/build/ceph-12.1.1/src/common/HeartbeatMap.cc: In function
'ceph::HeartbeatMap::~HeartbeatMap()' thread 7faf1752fc80 time 2017-07-24
13:22:21.558275
/build/ceph-12.1.1/src/common/HeartbeatMap.cc: 39: FAILED
assert(m_workers.empty())

 ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous
(rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x558ba6ba6b72]
 2: (()+0xb81cf1) [0x558ba6cc0cf1]
 3: (CephContext::~CephContext()+0x4d9) [0x558ba6ca77b9]
 4: (CephContext::put()+0xe6) [0x558ba6ca7ab6]
 5: (main()+0x563) [0x558ba650df73]
 6: (__libc_start_main()+0xf0) [0x7faf14999830]
 7: (_start()+0x29) [0x558ba6597cf9]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph recovery incomplete PGs on Luminous RC

2017-07-24 Thread Daniel K
I was able to export the PGs using the ceph-object-store tool and import
them to the new OSDs.

I moved some other OSDs from the bare metal on a node into a virtual
machine on the same node and was surprised at how easy it was. Install ceph
in the VM(using ceph-deploy) -- stop the OSD and dismount OSD drive from
physical machine, mount it to the VM, the OSD was auto-detected and
ceph-osd process started automatically and was up within a few seconds.

I'm having a different problem now that I will make a separate message
about.

Thanks!


On Mon, Jul 24, 2017 at 12:52 PM, Gregory Farnum <gfar...@redhat.com> wrote:

>
> On Fri, Jul 21, 2017 at 10:23 PM Daniel K <satha...@gmail.com> wrote:
>
>> Luminous 12.1.0(RC)
>>
>> I replaced two OSD drives(old ones were still good, just too small),
>> using:
>>
>> ceph osd out osd.12
>> ceph osd crush remove osd.12
>> ceph auth del osd.12
>> systemctl stop ceph-osd@osd.12
>> ceph osd rm osd.12
>>
>> I later found that I also should have unmounted it from
>> /var/lib/ceph/osd-12
>>
>> (remove old disk, insert new disk)
>>
>> I added the new disk/osd with ceph-deploy osd prepare stor-vm3:sdg
>> --bluestore
>>
>> This automatically activated the osd (not sure why, I thought it needed a
>> ceph-deploy osd activate as well)
>>
>>
>> Then, working on an unrelated issue, I upgraded one (out of 4 total)
>> nodes to 12.1.1 using apt and rebooted.
>>
>> The mon daemon would not form a quorum with the others on 12.1.0, so,
>> instead of troubleshooting that, I just went ahead and upgraded the other 3
>> nodes and rebooted.
>>
>> Lots of recovery IO went on afterwards, but now things have stopped at:
>>
>> pools:   10 pools, 6804 pgs
>> objects: 1784k objects, 7132 GB
>> usage:   11915 GB used, 19754 GB / 31669 GB avail
>> pgs: 0.353% pgs not active
>>  70894/2988573 objects degraded (2.372%)
>>  422090/2988573 objects misplaced (14.123%)
>>  6626 active+clean
>>  129  active+remapped+backfill_wait
>>  23   incomplete
>>  14   active+undersized+degraded+remapped+backfill_wait
>>  4active+undersized+degraded+remapped+backfilling
>>  4active+remapped+backfilling
>>  2active+clean+scrubbing+deep
>>  1peering
>>  1active+recovery_wait+degraded+remapped
>>
>>
>> when I run ceph pg query on the incompletes, they all list at least one
>> of the two removed OSDs(12,17) in "down_osds_we_would_probe"
>>
>> most pools are size:2 min_size 1(trusting bluestore to tell me which one
>> is valid). One pool is size:1 min size:1 and I'm okay with losing it,
>> except I had it mounted in a directory on cephfs, I rm'd the directory but
>> I can't delete the pool because it's "in use by CephFS"
>>
>>
>> I still have the old drives, can I stick them into another host and
>> re-add them somehow?
>>
>
> Yes, that'll probably be your easiest solution. You may have some trouble
> because you already deleted them, but I'm not sure.
>
> Alternatively, you ought to be able to remove the pool from CephFS using
> some of the monitor commands and then delete it.
>
>
>> This data isn't super important, but I'd like to learn a bit on how to
>> recover when bad things happen as we are planning a production deployment
>> in a couple of weeks.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dealing with incomplete PGs while using bluestore

2017-07-22 Thread Daniel K
I am in the process of doing exactly what you are -- this worked for me:

1. mount the first partition of the bluestore drive that holds the missing
PGs (if it's not already mounted)
> mkdir /mnt/tmp
> mount /dev/sdb1 /mnt/tmp


2. export the pg to a suitable temporary storage location:
> ceph-objectstore-tool --data-path /mnt/tmp --pgid 1.24 --op export --file
/mnt/sdd1/recover.1.24

3. find the acting osd
> ceph health detail |grep incomplete

PG_DEGRADED Degraded data redundancy: 23 pgs unclean, 23 pgs incomplete
pg 1.24 is incomplete, acting [18,13]
pg 4.1f is incomplete, acting [11]
...
4. set noout
> ceph osd set noout

5. Find the OSD and log into it -- I used 18 here.
> ceph osd find 18
{
"osd": 18,
"ip": "10.0.15.54:6801/9263",
"crush_location": {
"building": "building-dc",
"chassis": "chassis-dc400f5-10",
"city": "city",
"floor": "floor-dc4",
"host": "stor-vm4",
"rack": "rack-dc400f5",
"region": "cfl",
"room": "room-dc400",
"root": "default",
"row": "row-dc400f"
}
}

> ssh user@10.0.15.54

6. copy the file to somewhere accessible by the new(acting) osd
> scp user@10.0.14.51:/mnt/sdd1/recover.1.24 /tmp/recover.1.24

7. stop the osd
> service ceph-osd@18 stop

8. import the file using ceph-objectstore-tool
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-18 --op import
--file /tmp/recover.1.24

9. start the osd
> service-osd@18 start

this worked for me -- not sure if this is the best way or if I took any
extra steps and I have yet to validate that the data is good.

I based this partially off your original email, and the guide here
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/






On Sat, Jul 22, 2017 at 4:46 PM, mofta7y  wrote:

> Hi All,
>
> I have a situation here.
>
> I have an EC pool that is having cache tier pool (the cache tier is
> replicated with size 2).
>
> Had an issue on the pool and the crush map got changed after rebooting
> some OSD in any case I lost 4 cache ties OSDs
>
> those lost OSDs are not really lost they look fine to me but bluestore is
> giving me exception when starting them i cant deal with it. (will open
> question about that exception as well)
>
> So now i have 14 incomplete Pgs on the caching tier.
>
>
> I am trying to recover them using ceph-objectstore-tool
>
> the extraction and import works nice with no issues but the OSD fail to
> start after wards with same issue as the original OSD .
>
> after importing the PG on the acting OSD i get the exact same exception I
> was getting while trying to start the failed OSD
>
> removing that import resolve the issue.
>
>
> So the question is how can use ceph-objectstore-tool to import in
> bluestore as i think i am missing somthing here
>
>
> here is the procedure and the steps i used
>
> 1- stop old osd (it cannot start anyway)
>
> 2- use this command to extract the pg i need
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-116 --pgid 15.371
> --op export --file /tmp/recover.15.371
>
> that command work
>
> 3- check what is the acting OSD for the pg
>
> 4- stop the acting OSD
>
> 5- delete the current folder with same og name
>
> 6- use this command
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-78  --op import
> /tmp/recover.15.371
> the error i got in both cases is this bluestore error
>
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:   -257> 2017-07-22 16:20:19.544195
> 7f7157036a40 -1 osd.116 119691 log_to_monitors {default=true}
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  0> 2017-07-22 16:35:20.142143
> 7f713c597700 -1 /tmp/buildd/ceph-11.2.0/src/os/bluestore/BitMapAllocator.cc:
> In function 'virtual int BitMapAllocator::reserve(uint64_t)' thread
> 7f713c597700 time 2017-07-22 16:35:20.139309
> Jul 22 16:35:20 alm9 ceph-osd[3799171]: 
> /tmp/buildd/ceph-11.2.0/src/os/bluestore/BitMapAllocator.cc:
> 82: FAILED assert(!(need % m_block_size))
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  ceph version 11.2.0
> (f223e27eeb35991352ebc1f67423d4ebc252adb7)
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  1: (ceph::__ceph_assert_fail(char
> const*, char const*, int, char const*)+0x80) [0x562b84558380]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  2: (BitMapAllocator::reserve(unsigned
> long)+0x2ab) [0x562b8437c5cb]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  3: (BlueFS::reclaim_blocks(unsigned
> int, unsigned long, std::vector mempool::pool_allocator<(mempool::pool_index_t)7,
> AllocExtent> >*)+0x22a) [0x562b8435109a]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  4: (BlueStore::_balance_bluefs_fr
> eespace(std::vector >*)+0x28e) [0x562b84270dae]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  5: 
> (BlueStore::_kv_sync_thread()+0x164a)
> [0x562b84273eea]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  6: 
> (BlueStore::KVSyncThread::entry()+0xd)
> [0x562b842ad9dd]
> Jul 22 16:35:20 alm9 ceph-osd[3799171]:  7: (()+0x76ba) 

[ceph-users] ceph recovery incomplete PGs on Luminous RC

2017-07-21 Thread Daniel K
Luminous 12.1.0(RC)

I replaced two OSD drives(old ones were still good, just too small), using:

ceph osd out osd.12
ceph osd crush remove osd.12
ceph auth del osd.12
systemctl stop ceph-osd@osd.12
ceph osd rm osd.12

I later found that I also should have unmounted it from /var/lib/ceph/osd-12

(remove old disk, insert new disk)

I added the new disk/osd with ceph-deploy osd prepare stor-vm3:sdg
--bluestore

This automatically activated the osd (not sure why, I thought it needed a
ceph-deploy osd activate as well)


Then, working on an unrelated issue, I upgraded one (out of 4 total) nodes
to 12.1.1 using apt and rebooted.

The mon daemon would not form a quorum with the others on 12.1.0, so,
instead of troubleshooting that, I just went ahead and upgraded the other 3
nodes and rebooted.

Lots of recovery IO went on afterwards, but now things have stopped at:

pools:   10 pools, 6804 pgs
objects: 1784k objects, 7132 GB
usage:   11915 GB used, 19754 GB / 31669 GB avail
pgs: 0.353% pgs not active
 70894/2988573 objects degraded (2.372%)
 422090/2988573 objects misplaced (14.123%)
 6626 active+clean
 129  active+remapped+backfill_wait
 23   incomplete
 14   active+undersized+degraded+remapped+backfill_wait
 4active+undersized+degraded+remapped+backfilling
 4active+remapped+backfilling
 2active+clean+scrubbing+deep
 1peering
 1active+recovery_wait+degraded+remapped


when I run ceph pg query on the incompletes, they all list at least one of
the two removed OSDs(12,17) in "down_osds_we_would_probe"

most pools are size:2 min_size 1(trusting bluestore to tell me which one is
valid). One pool is size:1 min size:1 and I'm okay with losing it, except I
had it mounted in a directory on cephfs, I rm'd the directory but I can't
delete the pool because it's "in use by CephFS"


I still have the old drives, can I stick them into another host and re-add
them somehow?

This data isn't super important, but I'd like to learn a bit on how to
recover when bad things happen as we are planning a production deployment
in a couple of weeks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to map rbd using rbd-nbd on boot?

2017-07-21 Thread Daniel K
Once again my google-fu has failed me and I can't find the 'correct' way to
map an rbd using rbd-nbd on boot. Everything takes me to rbdmap, which
isn't using rbd-nbd.

If someone could just point me in the right direction I'd appreciated it.


Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-fuse performance

2017-06-28 Thread Daniel K
thank you!

On Wed, Jun 28, 2017 at 11:48 AM, Mykola Golub <mgo...@mirantis.com> wrote:

> On Tue, Jun 27, 2017 at 07:17:22PM -0400, Daniel K wrote:
>
> > rbd-nbd isn't good as it stops at 16 block devices (/dev/nbd0-15)
>
> modprobe nbd nbds_max=1024
>
> Or, if nbd module is loaded by rbd-nbd, use --nbds_max command line
> option.
>
> --
> Mykola Golub
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd-fuse performance

2017-06-27 Thread Daniel K
Hi,

As mentioned in my previous emails, I'm extremely new to ceph, so please
forgive my lack of knowledge.

I'm trying to find a good way to mount ceph rbd images for export by
LIO/targetcli

rbd-nbd isn't good as it stops at 16 block devices (/dev/nbd0-15)

kernel rbd mapping doesn't have support for new features.

I thought rbd-fuse looked good, except write performance is abysmal.

rados bench gives me ~250MB/s of write speed. an image mounted with
rbd-fuse gives me ~2MB/s of write speed. CephFS write speeds are good as
well.

Is something wrong with my testing method or configuration?


root@stor-vm1:/# ceph osd pool create rbd_storage 128 128
root@stor-vm1:/# rbd create --pool=rbd_storage --size=25G rbd_25g1
root@stor-vm1:/# mkdir /mnt/rbd
root@stor-vm1:/# cd /mnt
root@stor-vm1:/# rbd-fuse rbd -p rbd_storage
root@stor-vm1:/# cd rbd
root@stor-vm1:/# dd if=/dev/zero of=rbd_25g1 bs=4M count=2 status=progress
8388608 bytes (8.4 MB, 8.0 MiB) copied, 4.37754 s, 1.9 MB/s
2+0 records in
2+0 records out
8388608 bytes (8.4 MB, 8.0 MiB) copied, 4.3776 s, 1.9 MB/s


rados bench:

root@stor-vm1:/mnt/rbd# rados bench -p rbd_storage 10 write
2017-06-27 18:56:59.505647 7fb9c24a7e00 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore
2017-06-27 18:56:59.505768 7fb9c24a7e00 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore
2017-06-27 18:56:59.507385 7fb9c24a7e00 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_stor-vm1_8786
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
0
1  166347   187.989   1880.620617
 0.285428
2  16   134   118   235.976   2840.195319
 0.250789
3  16   209   193   257.306   3000.198448
 0.239798
4  16   282   266   265.972   2920.232927
 0.233386
5  16   362   346   276.771   3200.222398
 0.226373
6  16   429   413   275.303   2680.193111
 0.226703
7  16   490   474   270.828   244   0.0879974
 0.228776
8  16   562   546272.97   2880.125843
 0.230455
9  16   625   609   270.637   2520.145847
 0.232388
   10  16   701   685273.97   3040.411055
 0.230831
Total time run: 10.161789
Total writes made:  702
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 276.329
Stddev Bandwidth:   38.1925
Max bandwidth (MB/sec): 320
Min bandwidth (MB/sec): 188
Average IOPS:   69
Stddev IOPS:9
Max IOPS:   80
Min IOPS:   47
Average Latency(s): 0.231391
Stddev Latency(s):  0.107305
Max latency(s): 0.774406
Min latency(s): 0.0828756
Cleaning up (deleting benchmark objects)
Removed 702 objects
Clean up completed and total clean up time :1.190687




Thanks,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous/Bluestore compression documentation

2017-06-27 Thread Daniel K
Is there anywhere that details the various compression settings for
bluestore backed pools?

I can see compression in the list of options when I run ceph osd pool set,
but can't find anything that details what valid settings are.

I've tried discovering the options via the command line utilities and via
google and have failed at both.

Thanks,
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds exist in the crush map but not in the osdmap after kraken > luminous rc1 upgrade

2017-06-27 Thread Daniel K
Well that was simple.

In the process of preparing the decompiled crush map, ceph status, ceph osd
tree for posting I noticed that those two OSDs -- 5 & 11 didn't exist.
Which explains it. I removed them from the crushmap and all is well now.

Nothing changed in the config from kraken to luminous, so I guess kraken
just didn't have a health check for that problem.


Thanks for the help!


Dan



On Tue, Jun 27, 2017 at 2:18 PM, David Turner <drakonst...@gmail.com> wrote:

> Can you post your decompiled crush map, ceph status, ceph osd tree, etc?
> Something will allow what the extra stuff is and the easiest way to remove
> it.
>
> On Tue, Jun 27, 2017, 12:12 PM Daniel K <satha...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm extremely new to ceph and have a small 4-node/20-osd cluster.
>>
>> I just upgraded from kraken to luminous without much ado, except now when
>> I run ceph status, I get a health_warn because "2 osds exist in the crush
>> map but not in the osdmap"
>>
>> Googling the error message only took me to the source file on github
>>
>> I tried exporting and decompiling  the crushmap -- there were two osd
>> devices named differently. The normal name would be something like
>>
>> device 0 osd.0
>> device 1 osd.1
>>
>> but two were named:
>>
>> device 5 device5
>> device 11 device11
>>
>> I had edited the crushmap in the past, so it's possible this was
>> introduced by me.
>>
>> I tried changing those to match the rest, recompiling and setting the
>> crushmap, but ceph status still complains.
>>
>> Any assistance would be greatly appreciated.
>>
>> Thanks,
>> Dan
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osds exist in the crush map but not in the osdmap after kraken > luminous rc1 upgrade

2017-06-27 Thread Daniel K
Hi,

I'm extremely new to ceph and have a small 4-node/20-osd cluster.

I just upgraded from kraken to luminous without much ado, except now when I
run ceph status, I get a health_warn because "2 osds exist in the crush map
but not in the osdmap"

Googling the error message only took me to the source file on github

I tried exporting and decompiling  the crushmap -- there were two osd
devices named differently. The normal name would be something like

device 0 osd.0
device 1 osd.1

but two were named:

device 5 device5
device 11 device11

I had edited the crushmap in the past, so it's possible this was introduced
by me.

I tried changing those to match the rest, recompiling and setting the
crushmap, but ceph status still complains.

Any assistance would be greatly appreciated.

Thanks,
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Daniel K
I started down that path and got so deep that I couldn't even find where I
went in. I couldn't make heads or tails out of what would or wouldn't work.

We didn't need multiple hosts accessing a single datastore, so on the
client side I just have a single VM guest running on each ESXi hosts, with
the cephfs filesystem mounted on it(via a 10Gb connection to the ceph
environment), and then exported via NFS on a host-only network, and mounted
on the host.

Not quite as redundant as it could be, but good enough for our usage. I'm
seeing ~500MB/s speeds going to a 4-node cluster with 5x1TB 7200rpm drives.
I tried it first, in a similar config, except using LIO to export an RBD
device via iSCSI, still on  the local host network. Write performance was
good, but read performance was only around 120MB/s. I didn't do much
troubleshooting, just tried NFS after that and was happy with it.

On Tue, Jun 6, 2017 at 2:33 AM, Adrian Saul 
wrote:

> > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > > and
> > > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > > and saw much worse performance with the first cluster, so it seems
> > > this may be the better way, but I'm open to other suggestions.
> > >
> > I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph,
> > though other people here have made significant efforts.
>
> In our tests our best results were with SCST - also because it provided
> proper ALUA support at the time.  I ended up developing my own pacemaker
> cluster resources to manage the SCST orchestration and ALUA failover.  In
> our model we have  a pacemaker cluster in front being an RBD client
> presenting LUNs/NFS out to VMware (NFS), Solaris and Hyper-V (iSCSI).  We
> are using CephFS over NFS but performance has been poor, even using it just
> for VMware templates.  We are on an earlier version of Jewel so its
> possibly some later versions may improve CephFS for that but I have not had
> time to test it.
>
> We have been running a small production/POC for over 18 months on that
> setup, and gone live into a much larger setup in the last 6 months based on
> that model.  It's not without its issues, but most of that is a lack of
> test resources to be able to shake out some of the client compatibility and
> failover shortfalls we have.
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Daniel K
Christian,

Thank you for the tips -- I certainly googled my eyes out for a good while
before asking -- maybe my google-fu wasn't too good last night.

> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).

I've always heard people speak fondly of IB, but I've honestly never dealt
with it. I'm mostly a network guy at heart, so I'm perfectly comfortable
aggregating 10GB/s connections till the cows come home. What are some of
the virtues of IB, over ethernet? (not ethernet over IB)

> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you

I do like to play with fire often, but not normally with other people's
data. I suppose I will stay away from Bluestore for now, unless Luminous is
released within the next few weeks. I am using it on  Kraken in my small
test-cluster so far without a visit from Murphy.

> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

> I'm guessing you have no budget to improve on that gift horse?

It's a Micron 1100 256Gb, rated for 120TBW, which works out to about
100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to
journal 36 1TB drives.

I do have some room in the budget, and NVMe journals have been on the back
of my mind. These servers have 6 PCIe x8 slots in them, so tons of room.
But then I'm going to get asked about a cache tier, which everyone seems to
think is the holy grail (and probably would be, if they could 'just work')

But from what I read, they're an utter nightmare to manage, particularly
without a well defined workload, and often would hurt more than they help.

I haven't spent a ton of time with the network gear that was dumped on me,
but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do
have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB
switch), what appears to be a giant IB switch (Qlogic 12800-120) and
another apparently big boy (Qlogic 12800-180). I'm going to pick them up
from the warehouse tomorrow.

If I stay away from IB completely, may just use the IB card as a 4x10GB +
the 2x 10GB on board like I had originally mentioned. But if that IB gear
is good, I'd hate to see it go to waste. Might be worth getting a second IB
card for each server.



Again, thanks a million for the advice. I'd rather learn this the easy way
than to have to rebuild this 6 times over the next 6 months.






On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> lots of similar questions in the past, google is your friend.
>
> On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:
>
> > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> > Supermicro servers and dual 10Gb interfaces(one cluster, one public)
> >
> > I now have 9x 36-drive supermicro StorageServers made available to me,
> each
> > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> > drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
> >
> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).
>
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and
> saw
> > much worse performance with the first cluster, so it seems this may be
> the
> > better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph, though other people here have made significant efforts.
>
> > Considerations:
> > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> > raid card to present a fewer number of larger devices to ceph? Or run
> > multiple drives per OSD?
> >
> You're definitely underpowered in the CPU department and I personally
> would make RAID1 or 10s for never having to re-balance an OSD.
> But if space is an issue, RAID0s would do.
> OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
> CPU hungry than others.
>
> > There is a single 256gb SSD which i feel would be a bottleneck if I used
> it

[ceph-users] design guidance

2017-06-05 Thread Daniel K
I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
Supermicro servers and dual 10Gb interfaces(one cluster, one public)

I now have 9x 36-drive supermicro StorageServers made available to me, each
with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
drives. Currently 32GB of ram. 36x 1TB 7.2k drives.

Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
6.0 hosts(migrating from a VMWare environment), later to transition to
qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw
much worse performance with the first cluster, so it seems this may be the
better way, but I'm open to other suggestions.

Considerations:
Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
raid card to present a fewer number of larger devices to ceph? Or run
multiple drives per OSD?

There is a single 256gb SSD which i feel would be a bottleneck if I used it
as a journal for all 36 drives, so I believe bluestore with a journal on
each drive would be the best option.

Is 1.7Ghz too slow for what I'm doing?

I like the idea of keeping the public and cluster networks separate. Any
suggestions on which interfaces to use for what? I could theoretically push
36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
that? Perhaps bond the two 10GB and use them as the public, and the 40gb as
the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
for each?


If there is a more appropriate venue for my request, please point me in
that direction.

Thanks,
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kraken bluestore compression

2017-06-05 Thread Daniel K
Hi,

I see several mentions that compression is available in Kraken for
bluestore OSDs, however, I can find almost nothing in the documentation
that indicates how to use it.

I've found:
- http://docs.ceph.com/docs/master/radosgw/compression/
- http://ceph.com/releases/v11-2-0-kraken-released/

I'm fairly new to ceph, so I don't have a good grasp of how rados commands
apply to osd pools, so if the first link is relavant, I apologize.

I am seeing:
"|compression_mode|compression_algorithm|compression_required_ratio|compression_max_blob_size|compression_min_blob_size"
as options when I run ceph osd pool set -- but I can't find anything
documented to explain what parameters are available for those options.


Could someone point me in the right direction?

Thanks,
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds slow request, getattr currently failed to rdlock. Kraken with Bluestore

2017-05-24 Thread Daniel K
Yes -- the crashed server also mounted cephfs as a client, and also likely
had active writes to the file when it crashed.

I have the max file size set to  17,592,186,044,416 -- but this file was
about 5.8TB.

The likely reason for the crash? The file was mounted as a fileio backstore
to LIO, which was exported as an FC lun that I had connected to an ESXi
server, mapped via RDM to a guest, in which I had a dd if=/dev/zero
of=/dev/sdb bs=1M count=6 running(for several hours)


Which I think was breaking at least 3 "don't do this" rules with ceph. Once
it moves into production the pieces will be separated.










On Wed, May 24, 2017 at 4:55 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Wed, May 24, 2017 at 3:15 AM, John Spray <jsp...@redhat.com> wrote:
> > On Tue, May 23, 2017 at 11:41 PM, Daniel K <satha...@gmail.com> wrote:
> >> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs
> >> enroute.
> >>
> >> I was "beating up" on the cluster, and had been writing to a 6TB file in
> >> CephFS for several hours, during which I changed the crushmap to better
> >> match my environment, generating a bunch of recovery IO. After about
> 5.8TB
> >> written, one of the OSD(which is also a MON..soon to be rectivied) hosts
> >> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph
> -s:
> >> (The degraded/misplaced warnings are likely because the cluster hasn't
> >> completed rebalancing after I changed the crushmap I assume)
> >>
> >
> > Losing a quarter of your OSDs down while simultaneously rebalancing
> > after editing your CRUSH map is a brutal thing to a Ceph cluster, and
> > I would expect it to impact your client IO severely.
> >
> > I see that you've got 112MB/s of recovery going on, which may or may
> > not be saturating some links depending on whether you're using 1gig or
> > 10gig networking.
> >
> >> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following
> dangerous
> >> and experimental features are enabled: bluestore
> >> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following
> dangerous
> >> and experimental features are enabled: bluestore
> >> cluster e92e20ca-0fe6-4012-86cc-aa51e041
> >>  health HEALTH_WARN
> >> 440 pgs backfill_wait
> >> 7 pgs backfilling
> >> 85 pgs degraded
> >> 5 pgs recovery_wait
> >> 85 pgs stuck degraded
> >> 452 pgs stuck unclean
> >> 77 pgs stuck undersized
> >> 77 pgs undersized
> >> recovery 196526/3554278 objects degraded (5.529%)
> >> recovery 1690392/3554278 objects misplaced (47.559%)
> >> mds0: 1 slow requests are blocked > 30 sec
> >>  monmap e4: 3 mons at
> >> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,
> stor-vm3=10.0.15.53:6789/0}
> >> election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3
> >>   fsmap e21: 1/1/1 up {0=stor-vm4=up:active}
> >> mgr active: stor-vm1 standbys: stor-vm2
> >>  osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs
> >> flags sortbitwise,require_jewel_osds,require_kraken_osds
> >>   pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects
> >> 11041 GB used, 16901 GB / 27943 GB avail
> >> 196526/3554278 objects degraded (5.529%)
> >> 1690392/3554278 objects misplaced (47.559%)
> >>  975 active+clean
> >>  364 active+remapped+backfill_wait
> >>   76 active+undersized+degraded+remapped+backfill_wait
> >>3 active+recovery_wait+degraded+remapped
> >>3 active+remapped+backfilling
> >>3 active+degraded+remapped+backfilling
> >>2 active+recovery_wait+degraded
> >>1 active+clean+scrubbing+deep
> >>1 active+undersized+degraded+remapped+backfilling
> >> recovery io 112 MB/s, 28 objects/s
> >>
> >>
> >> Seems related to the "corrupted rbd filesystems since jewel" thread.
> >>
> >>
> >> log entries on the MDS server:
> >>
> >> 2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log
> [WRN] :
> >> slow request 243.113407 seconds old, received at 2017-05-23
> 18:23:09.852729:
> >> cl

Re: [ceph-users] mds slow request, getattr currently failed to rdlock. Kraken with Bluestore

2017-05-24 Thread Daniel K
Networking is 10Gig. I notice recovery IO is wildly variable, I assume
that's normal.

Very little load as this is yet to go into production, I was "seeing what
it would handle" at the time it broke.

I checked this morning and the slow request had gone and I could access the
blocked file again.

All OSes are Ubuntu 16.04.01 with the stock 4.4.0-72-generic kernel, and
there were two CephFS clients accessing it, also 16.04.1.

Ceph on all is 11.2.0, installed from the debian-kraken repos at
download.ceph.com. All OSDs are bluestore.


As of now all is okay, so don't want to waste anyone's time on a wild goose
chase.




On Wed, May 24, 2017 at 6:15 AM, John Spray <jsp...@redhat.com> wrote:

> On Tue, May 23, 2017 at 11:41 PM, Daniel K <satha...@gmail.com> wrote:
> > Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs
> > enroute.
> >
> > I was "beating up" on the cluster, and had been writing to a 6TB file in
> > CephFS for several hours, during which I changed the crushmap to better
> > match my environment, generating a bunch of recovery IO. After about
> 5.8TB
> > written, one of the OSD(which is also a MON..soon to be rectivied) hosts
> > crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph
> -s:
> > (The degraded/misplaced warnings are likely because the cluster hasn't
> > completed rebalancing after I changed the crushmap I assume)
> >
>
> Losing a quarter of your OSDs down while simultaneously rebalancing
> after editing your CRUSH map is a brutal thing to a Ceph cluster, and
> I would expect it to impact your client IO severely.
>
> I see that you've got 112MB/s of recovery going on, which may or may
> not be saturating some links depending on whether you're using 1gig or
> 10gig networking.
>
> > 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following
> dangerous
> > and experimental features are enabled: bluestore
> > 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following
> dangerous
> > and experimental features are enabled: bluestore
> > cluster e92e20ca-0fe6-4012-86cc-aa51e041
> >  health HEALTH_WARN
> > 440 pgs backfill_wait
> > 7 pgs backfilling
> > 85 pgs degraded
> > 5 pgs recovery_wait
> > 85 pgs stuck degraded
> > 452 pgs stuck unclean
> > 77 pgs stuck undersized
> > 77 pgs undersized
> > recovery 196526/3554278 objects degraded (5.529%)
> > recovery 1690392/3554278 objects misplaced (47.559%)
> > mds0: 1 slow requests are blocked > 30 sec
> >  monmap e4: 3 mons at
> > {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor-
> vm3=10.0.15.53:6789/0}
> > election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3
> >   fsmap e21: 1/1/1 up {0=stor-vm4=up:active}
> > mgr active: stor-vm1 standbys: stor-vm2
> >  osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs
> > flags sortbitwise,require_jewel_osds,require_kraken_osds
> >   pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects
> > 11041 GB used, 16901 GB / 27943 GB avail
> > 196526/3554278 objects degraded (5.529%)
> > 1690392/3554278 objects misplaced (47.559%)
> >  975 active+clean
> >  364 active+remapped+backfill_wait
> >   76 active+undersized+degraded+remapped+backfill_wait
> >3 active+recovery_wait+degraded+remapped
> >3 active+remapped+backfilling
> >3 active+degraded+remapped+backfilling
> >2 active+recovery_wait+degraded
> >1 active+clean+scrubbing+deep
> >1 active+undersized+degraded+remapped+backfilling
> > recovery io 112 MB/s, 28 objects/s
> >
> >
> > Seems related to the "corrupted rbd filesystems since jewel" thread.
> >
> >
> > log entries on the MDS server:
> >
> > 2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log
> [WRN] :
> > slow request 243.113407 seconds old, received at 2017-05-23
> 18:23:09.852729:
> > client_request(client.204100:5 getattr pAsLsXsFs #10003ec 2017-05-23
> > 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed to
> > rdlock, waiting
> >
> >
> > output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I
> run
> > it)
>
> If that changes each time you run i