Re: [ceph-users] OSD's keep crasching after clusterreboot

2019-08-08 Thread Ansgar Jazdzewski
We got our OSD's back

Since we removed the EC-Pool (cephfs.data) we had to figure out how to
remove the PG from teh Offline OSD and hier is how we did it.

remove cehfs, remove cache layer, remove pools:
#ceph mds fail 0
#ceph fs rm cephfs --yes-i-really-mean-it
#ceph osd tier remove-overlay cephfs.data
there is now (or already was) no overlay for 'cephfs.data'
#ceph osd tier remove cephfs.data cephfs.cache
pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data'
#ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
#ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it
pool 'cephfs.cache' removed
#ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it
pool 'cephfs.data' removed
#ceph osd pool delete cephfs.metadata cephfs.metadata
--yes-i-really-really-mean-it
pool 'cephfs.metadata' removed

remove placement groups of pool 23 (cephfs.data) from all offline OSDs:
DATAPATH=/var/lib/ceph/osd/ceph-${OSD}
a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."`
for i in $a; do
  echo "INFO: removing ${i} from OSD ${OSD}"
  ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force
done

since we now had removed our cephfs we still not know if we could have
solved it without data loss by upgrading to nautilus.

Have a nice Weekend,
Ansgar

Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski
:
>
> another update,
>
> we now took the more destructive route and removed the cephfs pools
> (lucky we had only test date in the filesystem)
> Our hope was that within the startup-process the osd will delete the
> no longer needed PG, But this is NOT the Case.
>
> So we are still have the same issue the only difference is that the PG
> does not belong to a pool anymore.
>
>  -360> 2019-08-07 14:52:32.655 7fb14db8de00  5 osd.44 pg_epoch: 196586
> pg[23.f8s0(unlocked)] enter Initial
>  -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1
> /build/ceph-13.2.6/src/osd/ECUtil.h: In function
> 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
> 7fb14db8de00 time 2019-08-07 14:52:32.660169
> /build/ceph-13.2.6/src/osd/ECUtil.h: 34: FAILED assert(stripe_width %
> stripe_size == 0)
>
> we now can take one rout and try to delete the pg by hand in the OSD
> (bluestore) how this can be done? OR we try to upgrade to Nautilus and
> hope for the beset.
>
> any help hints are welcome,
> have a nice one
> Ansgar
>
> Am Mi., 7. Aug. 2019 um 11:32 Uhr schrieb Ansgar Jazdzewski
> :
> >
> > Hi,
> >
> > as a follow-up:
> > * a full log of one OSD failing to start https://pastebin.com/T8UQ2rZ6
> > * our ec-pool cration in the fist place https://pastebin.com/20cC06Jn
> > * ceph osd dump and ceph osd erasure-code-profile get cephfs
> > https://pastebin.com/TRLPaWcH
> >
> > as we try to dig more into it, it looks like a bug in the cephfs or
> > erasure-coding part of ceph.
> >
> > Ansgar
> >
> >
> > Am Di., 6. Aug. 2019 um 14:50 Uhr schrieb Ansgar Jazdzewski
> > :
> > >
> > > hi folks,
> > >
> > > we had to move one of our clusters so we had to boot all servers, now
> > > we found an Error on all OSD with the EC-Pool.
> > >
> > > do we miss some opitons, will an upgrade to 13.2.6 help?
> > >
> > >
> > > Thanks,
> > > Ansgar
> > >
> > > 2019-08-06 12:10:16.265 7fb337b83200 -1
> > > /build/ceph-13.2.4/src/osd/ECUtil.h: In function
> > > 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
> > > 7fb337b83200 time 2019-08-06 12:10:16.263025
> > > /build/ceph-13.2.4/src/osd/ECUtil.h: 34: FAILED assert(stripe_width %
> > > stripe_size == 0)
> > >
> > > ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic
> > > (stable) 1: (ceph::ceph_assert_fail(char const, char const, int, char
> > > const)+0x102) [0x7fb32eeb83c2] 2: (()+0x2e5587) [0x7fb32eeb8587] 3:
> > > (ECBackend::ECBackend(PGBackend::Listener, coll_t const&,
> > > boost::intrusive_ptr&, ObjectStore,
> > > CephContext, std::shared_ptr, unsigned
> > > long)+0x4de) [0xa4cbbe] 4: (PGBackend::build_pg_backend(pg_pool_t
> > > const&, std::map > > std::char_traits, std::allocator >,
> > > std::cxx11::basic_string,
> > > std::allocator >, std::less > > std::char_traits, std::allocator > >, std
> > > ::allocator > > std::char_traits, std::allocator > const,
> > > std::cxx11::basic_string,
> > > std::allocator > > > > const&, PGBackend::Listener, coll_t,
> > > boost::intrusive_ptr&, ObjectStore,
> > > CephContext)+0x2f9 ) [0x9474e9] 5:
> > > (PrimaryLogPG::PrimaryLogPG(OSDService, std::shared_ptr,
> > > PGPool const&, std::map > > std::char_traits, std::allocator >,
> > > std::cxx11::basic_string,
> > > std::allocator >, std::less > > std::char_tra its, std::allocator > >,
> > > std::allocator > > std::char_traits, std::allocator > const,
> > > std::cxx11::basic_string,
> > > std::allocator > > > > const&, spg_t)+0x138) [0x8f96e8] 6:
> > > (OSD::_make_pg(std::shared_ptr, spg_t)+0x11d3)
> > > [0x753553] 7: (OSD::load_pgs()+0x4a9) [0x758339] 8:
> > > (OSD::ini

Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-08 Thread Alexandre DERUMIER
Hi,

>>I'm running a single-host Ceph cluster for CephFS and I'd like to keep 
>>backups in Amazon S3 for disaster recovery. Is there a simple way to extract 
>>a CephFS snapshot as a single file and/or to create a file that represents 
>>the incremental difference between two snapshots?

I think it's on the roadmap for next ceph version.


- Mail original -
De: "Eitan Mosenkis" 
À: "Vitaliy Filippov" 
Cc: "ceph-users" 
Envoyé: Lundi 5 Août 2019 18:43:00
Objet: Re: [ceph-users] CephFS snapshot for backup & disaster recovery

I'm using it for a NAS to make backups from the other machines on my home 
network. Since everything is in one location, I want to keep a copy offsite for 
disaster recovery. Running Ceph across the internet is not recommended and is 
also very expensive compared to just storing snapshots. 

On Sun, Aug 4, 2019 at 3:08 PM Виталий Филиппов < [ mailto:vita...@yourcmc.ru | 
vita...@yourcmc.ru ] > wrote: 



Afaik no. What's the idea of running a single-host cephfs cluster? 

4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis < [ 
mailto:ei...@mosenkis.net | ei...@mosenkis.net ] > пишет: 
BQ_BEGIN

I'm running a single-host Ceph cluster for CephFS and I'd like to keep backups 
in Amazon S3 for disaster recovery. Is there a simple way to extract a CephFS 
snapshot as a single file and/or to create a file that represents the 
incremental difference between two snapshots? 




-- 
With best regards, 
Vitaliy Filippov 

BQ_END


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS corruption

2019-08-08 Thread ☣Adam
I had a machine with insufficient memory and it seems to have corrupted
data on my MDS.  The filesystem seems to be working fine, with the
exception of accessing specific files.

The ceph-mds logs include things like:
mds.0.1596621 unhandled write error (2) No such file or directory, force
readonly...
dir 0x100fb03 object missing on disk; some files may be lost
(/adam/programming/bash)

I'm using mimic and trying to follow the instructions here:
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/

The punchline is this:
cephfs-journal-tool --rank all journal export backup.bin
Error ((22) Invalid argument)
2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.

I have a backup (outside of ceph) of all data which is inaccessible and
I can back anything which is accessible if need be.  There's some more
information below, but my main question is: what are my next steps?

On a side note, I'd like to get involved with helping with documentation
(man pages, the ceph website, usage text, etc). Where can I get started?



Here's the context:

cephfs-journal-tool event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
argument

Seems like a bug in the documentation since `--rank` is a "mandatory
option" according to the help text.  It looks like the rank of this node
for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
`--rank all` doesn't work either:

ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
MDS_DAMAGE 1 MDSs report damaged metadata
mdsge.hax0rbana.org(mds.0): Metadata damage detected
MDS_READ_ONLY 1 MDSs are read only
mdsge.hax0rbana.org(mds.0): MDS in read-only mode

cephfs-journal-tool --rank 0 event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.


The only place I've found this error message is in an unanswered
stackoverflow question and in the source code here:
https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114

It looks like that is trying to read a filesystem map (fsmap), which
might be corrupted.  Running `rados export` prints part of the help text
and then segfaults, which is rather concerning.  This is 100% repeatable
(outside of gdb, details below).  I tried `rados df` and that worked
fine, so it's not all rados commands which are having this problem.
However, I tried `rados bench 60 seq` and that also printed out the
usage text and then segfaulted.





Info on the `rados export` crash:
rados export
usage: rados [options] [commands]
POOL COMMANDS

IMPORT AND EXPORT
   export [filename]
   Serialize pool contents to a file or standard out.

OMAP OPTIONS:
--omap-key-file fileread the omap key from a file
*** Caught signal (Segmentation fault) **
 in thread 7fcb6bfff700 thread_name:fn_anonymous

When running it in gdb:
(gdb) bt
#0  0x7fffef07331f in std::_Rb_tree, std::allocator >,
std::pair,
std::allocator > const, std::map,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d>, std::less, std::allocator, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d> > > > >,
std::_Select1st, std::allocator > const, std::map, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less,
std::allocator,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d> > > > > >,
std::less,
std::allocator > >,
std::allocator, std::allocator > const, std::map, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less,
std::allocator,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d> > > > > >
>::find(std::__cxx11::basic_string,
std::allocator > const&) const () from
/usr/lib/ceph/libceph-common.so.0
Backtrace stopped: Cannot access memory at address 0x7fffd9ff89f8

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner: "Acquired exclusive lock" every 21s

2019-08-08 Thread Matthias Leopold




Am 06.08.19 um 18:28 schrieb Mike Christie:

On 08/06/2019 07:51 AM, Matthias Leopold wrote:



Am 05.08.19 um 18:31 schrieb Mike Christie:

On 08/05/2019 05:58 AM, Matthias Leopold wrote:

Hi,

I'm still testing my 2 node (dedicated) iSCSI gateway with ceph 12.2.12
before I dare to put it into production. I installed latest tcmu-runner
release (1.5.1) and (like before) I'm seeing that both nodes switch
exclusive locks for the disk images every 21 seconds. tcmu-runner logs
look like this:

2019-08-05 12:53:04.184 13742 [WARN] tcmu_notify_lock_lost:222
rbd/iscsi.test03: Async lock drop. Old state 1
2019-08-05 12:53:04.714 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03:
Acquired exclusive lock.
2019-08-05 12:53:25.186 13742 [WARN] tcmu_notify_lock_lost:222
rbd/iscsi.test03: Async lock drop. Old state 1
2019-08-05 12:53:25.773 13742 [WARN] tcmu_rbd_lock:762 rbd/iscsi.test03:
Acquired exclusive lock.

Old state can sometimes be 0 or 2.
Is this expected behaviour?


What initiator OS are you using?



I'm using CentOS 7 initiators and I somehow missed to configure
multipathd on them correctly (device { vendor "LIO.ORG" ... }). After
fixing that the above problem disappeared and the output of "multipath
-ll" finally looks correct. Thanks for pointing me to this.

Nevertheless there's now another problem visible in the logs. As soon as
an initiator logs in tcmu-runner on the gateway node that doesn't own
the image being accessed logs

[ERROR] tcmu_rbd_has_lock:516 rbd/iscsi.test02: Could not check lock
ownership. Error: Cannot send after transport endpoint shutdown.

This disappears after the osd blacklist entries for the node expire
(visible with "ceph osd blacklist ls"). I haven't yet understood how
this is supposed to work, right now I restarted from scratch (logged
out, waited till all blacklist entries disappeared, logged in) and I'm
again seeing several blacklist entries for both gateway nodes (and the
above error message in tcmu-runner.log). This doesn't seem to interfere
with the iSCSI service, but I want this explained/resolved before I can
start using the gateways.


This is expected. Before multipath kicks in during path
addition/readdition and during failover/failback you can have IO on
multiple paths, so the lock is going to bounce temporarily and gws are
going to be blacklisted.

It should not happen non stop like you saw in the original email.



Thank you for explanation.

Matthias
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] out of memory bluestore osds

2019-08-08 Thread Jaime Ibar

Hi Mark,

thanks a lot for your explanation and clarification.

Adjusting osd_memory_target to fit in our systems did the trick.

Jaime

On 07/08/2019 14:09, Mark Nelson wrote:

Hi Jaime,


we only use the cache size parameters now if you've disabled 
autotuning.  With autotuning we adjust the cache size on the fly to 
try and keep the mapped process memory under the osd_memory_target.  
You can set a lower memory target than default, though you will have 
far less cache for bluestore onodes and rocksdb.  You may notice that 
it's slower, especially if you have a big active data set you are 
processing.  I don't usually recommend setting the osd_memory_target 
below 2GB.  At some point it will have shrunk the caches as far as it 
can and the process memory may start exceeding the target.  (with our 
default rocksdb and pglog settings this usually happens somewhere 
between 1.3-1.7GB once the OSD has been sufficiently saturated with 
IO). Given memory prices right now, I'd still recommend upgrading RAM 
if you have the ability though.  You might be able to get away with 
setting each OSD to 2-2.5GB in your scenario but you'll be pushing it.



I would not recommend lowering the osd_memory_cache_min.  You really 
want rocksdb indexes/filters fitting in cache, and as many bluestore 
onodes as you can get.  In any event, you'll still be bound by the 
(currently hardcoded) 64MB cache chunk allocation size in the 
autotuner which osd_memory_cache_min can't reduce (and that's per 
cache while osd_memory_cache_min is global for the kv,buffer, and 
rocksdb block caches).  IE each cache is going to get 64MB+growth room 
regardless of how low you set osd_memory_cache_min.  That's 
intentional as we don't want a single SST file in rocksdb to be able 
to completely blow everything else out of the block cache during 
compaction, only to quickly become invalid, removed from the cache, 
and make it look to the priority cache system like rocksdb doesn't 
actually need any more memory for cache.



Mark


On 8/7/19 7:44 AM, Jaime Ibar wrote:

Hi all,

we run a Ceph Luminous 12.2.12 cluster, 7 osds servers 12x4TB disks 
each.

Recently we redeployed the osds of one of them using bluestore backend,
however, after this, we're facing Out of memory errors(invoked 
oom-killer)

and the OS kills one of the ceph-osd process.
The osd is restarted automatically and back online after one minute.
We're running Ubuntu 16.04, kernel 4.15.0-55-generic.
The server has 32GB of RAM and 4GB of swap partition.
All the disks are hdd, no ssd disks.
Bluestore settings are the default ones

"osd_memory_target": "4294967296"
"osd_memory_cache_min": "134217728"
"bluestore_cache_size": "0"
"bluestore_cache_size_hdd": "1073741824"
"bluestore_cache_autotune": "true"

As stated in the documentation, bluestore assigns by default 4GB of
RAM per osd(1GB of RAM for 1TB).
So in this case 48GB of RAM would be needed. Am I right?

Are these the minimun requirements for bluestore?
In case adding more RAM is not an option, can any of
osd_memory_target, osd_memory_cache_min, bluestore_cache_size_hdd
be decrease to fit in our server specs?
Would this have any impact on performance?

Thanks
Jaime


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com