[ceph-users] Re: cephfs vs rbd

2021-10-08 Thread Robert W. Eckert
That is odd- I am running some game servers (ARK Survival) and the RBD mount 
starts up in less than a minute, but the CEPHFS mount takes 20 minutes or more. 
   It probably depends on the workload.  

-Original Message-
From: Marc  
Sent: Friday, October 8, 2021 5:50 PM
To: Jorge Garcia ; ceph-users@ceph.io
Subject: [ceph-users] Re: cephfs vs rbd

> I was wondering about performance differences between cephfs and rbd, 
> so I deviced this quick test. The results were pretty surprising to me.
> 
> The test: on a very idle machine, make 2 mounts. One is a cephfs 
> mount, the other an rbd mount. In each directory, copy a humongous 
> .tgz file
> (1.5 TB) and try to untar the file into the directory. The untar on 
> the cephfs directory took slightly over 2 hours, but on the rbd 
> directory it took almost a whole day. I repeated the test 3 times and 
> the results were similar each time. Is there something I'm missing? Is 
> RBD that much slower than cephfs (or is cephfs that much faster than 
> RBD)? Are there any tuning options I can try to improve RBD performance?
> 

When I was testing between using cephfs or rbd in a vm, I noticed that cephfs 
was around 25% faster, was on Luminous.


___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RFP for arm64 test nodes

2021-10-08 Thread Dan Mick
Ceph has been completely ported to build and run on ARM hardware 
(architecture arm64/aarch64), but we're unable to test it due to lack of 
hardware.  We propose to purchase a significant number of ARM servers 
(50+?) to install in our upstream Sepia test lab to use for upstream 
testing of Ceph, alongside the x86 hardware we already own.


This message is to start a discussion of what the nature of that 
hardware should be, and an investigation as to what's available and how 
much it might cost.  The general idea is to build something arm64-based 
that is similar to the smithi/gibba nodes:


https://wiki.sepia.ceph.com/doku.php?id=hardware:gibba

Some suggested features:

* base hardware/peripheral support for current releases of RHEL, CentOS, 
Ubuntu
* 1 fast and largish (400GB+) NVME drive for OSDs (it will be 
partitioned into 4-5 subdrives for tests)
* 1 large (1TB+) SSD/HDD for boot/system and logs (faster is better but 
not as crucial as for cluster storage)

* Remote/headless management (IPMI?)
* At least 1 10G network interface per host
* Order of 64GB main memory per host

Density is valuable to the lab; we have space but not an unlimited amount.

Any suggestions on vendors or specific server configurations?

Thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs + inotify

2021-10-08 Thread David Rivera
I see. This is true, I did monitor for changes on all clients involved.

On Fri, Oct 8, 2021, 12:27 Daniel Poelzleithner  wrote:

> On 08/10/2021 21:19, David Rivera wrote:
>
> > I've used inotify against a kernel mount a few months back. Worked fine
> for
> > me if I recall correctly.
>
> It can very much depend on the source of changes. It is easy to imagine
> changes originating from localhost get inotify events, while changes
> from other hosts might not.
>
> kind regards
>  poelzi
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs vs rbd

2021-10-08 Thread Marc
> I was wondering about performance differences between cephfs and rbd, so
> I deviced this quick test. The results were pretty surprising to me.
> 
> The test: on a very idle machine, make 2 mounts. One is a cephfs mount,
> the other an rbd mount. In each directory, copy a humongous .tgz file
> (1.5 TB) and try to untar the file into the directory. The untar on the
> cephfs directory took slightly over 2 hours, but on the rbd directory it
> took almost a whole day. I repeated the test 3 times and the results
> were similar each time. Is there something I'm missing? Is RBD that much
> slower than cephfs (or is cephfs that much faster than RBD)? Are there
> any tuning options I can try to improve RBD performance?
> 

When I was testing between using cephfs or rbd in a vm, I noticed that cephfs 
was around 25% faster, was on Luminous.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs vs rbd

2021-10-08 Thread Jorge Garcia

> Please describe the client system.
> Are you using the same one for CephFS and RBD?

Yes

> Kernel version?

Centos 8 4.18.0-240.15.1.el8_3.x86_64 (but also tried on a Centos 7 
machine, similar results)


> BM or VM?

Bare Metal

> KRBD or libvirt/librbd?

I'm assuming KRBD. I just did a simple set of instructions from the manual:

  rbd list
  rbd map rbdtest
  mount /dev/rbd0 /mnt/rbd

> Which filesystem did you have on the RBD volume, with what mkfs 
parameters?


xfs filesystem, default parameters (ie. "mkfs.xfs /dev/rbd0")

Everything was pretty much vanilla. This was a quick test...

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs _allocate unable to allocate

2021-10-08 Thread José H . Freidhof
Hi Igor,

"And was osd.2 redeployed AFTER settings had been reset to defaults ?"
A: YES

"Anything particular about current cluster use cases?"
A: we are using it temporary as a iscsi target for a vmware esxi cluster
with 6 hosts. We created two 10tb iscsi images/luns for vmware, because the
other datastore are at 90%.
We plan in the future, after ceph is working right, and stable to install
openstack and kvm and we want to convert all vms into rbd images.
Like i told you is a three osd nodes cluster with 32 cores and 256gb ram
and two 10g bond network cards on a 10g network

"E.g. is it a sort of regular usage (with load flukes and peak) or may be
some permanently running stress load testing. The latter might tend to hold
the resources and e.g. prevent from internal house keeping...
A: Its a SAN for vmware and there are running 43 VMs at the moment... at
the daytime is more stress on the disks because the people are working and
in the afternoon the iops goes down because the users are at home
noting speculative...

There is something else that i noticed... if i reboot one osd with
20osds then it takes 20min to come up... if i tail the logs of the osd i
can see a lot of " recovery log mode 2" on all osd
after the 20min the osd comes one after one up and the waldb are small and
no error in the logs about bluefs _allocate unable to allocate...

it seems that the problem is rocking up after a longer time (12h)


Am Fr., 8. Okt. 2021 um 15:24 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:

> And was osd.2 redeployed AFTER settings had been reset to defaults ?
>
> Anything particular about current cluster use cases?
>
> E.g. is it a sort of regular usage (with load flukes and peak) or may be
> some permanently running stress load testing. The latter might tend to hold
> the resources and e.g. prevent from internal house keeping...
>
> Igor
>
>
> On 10/8/2021 12:16 AM, José H. Freidhof wrote:
>
> Hi Igor,
>
> yes the same problem is on osd.2
>
> we have 3 OSD Nodes... Each Node has 20 Bluestore OSDs ... in total we
> have 60 OSDs
> i checked right now one node... and 15 of 20 OSDs have this problem and
> error in the log.
>
> the settings that you have complained some emails ago .. i have reverted
> them to default.
>
> ceph.conf file:
>
> [global]
> fsid = 462c44b4-eed6-11eb-8b2c-a1ad45f88a97
> mon_host = [v2:10.50.50.21:3300/0,v1:10.50.50.21:6789/0] [v2:
> 10.50.50.22:3300/0,v1:10.50.50.22:6789/0] [v2:
> 10.50.50.20:3300/0,v1:10.50.50.20:6789/0]
> log file = /var/log/ceph/$cluster-$type-$id.log
> max open files = 131072
> mon compact on trim = False
> osd deep scrub interval = 137438953472
> osd max scrubs = 16
> osd objectstore = bluestore
> osd op threads = 2
> osd scrub load threshold = 0.01
> osd scrub max interval = 137438953472
> osd scrub min interval = 137438953472
> perf = True
> rbd readahead disable after bytes = 0
> rbd readahead max bytes = 4194304
> throttler perf counter = False
>
> [client]
> rbd cache = False
>
>
> [mon]
> mon health preluminous compat = True
> mon osd down out interval = 300
>
> [osd]
> bluestore cache autotune = 0
> bluestore cache kv ratio = 0.2
> bluestore cache meta ratio = 0.8
> bluestore extent map shard max size = 200
> bluestore extent map shard min size = 50
> bluestore extent map shard target size = 100
> bluestore rocksdb options =
> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
> osd map share max epochs = 100
> osd max backfills = 5
> osd op num shards = 8
> osd op num threads per shard = 2
> osd min pg log entries = 10
> osd max pg log entries = 10
> osd pg log dups tracked = 10
> osd pg log trim min = 10
>
>
>
> root@cd133-ceph-osdh-01:~# ceph config dump
> WHO   MASK
>  LEVEL OPTION   VALUE
>
>
>
>
>
>RO
> global
>  basic container_image
> docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
>
>
>
>
>*
> global
>  advanced  leveldb_max_open_files   131072
>
>
>
>
> global
>  advanced  mon_compact_on_trim  false
>
>
>
>
> global
>  dev   ms_crc_data  false
>
>
>
>
> global
>  advanced  

[ceph-users] Re: cephadm adopt with another user than root

2021-10-08 Thread Daniel Pivonka
Id have to test this to make sure it works but i believe you can run   'ceph
cephadm set-user '
https://docs.ceph.com/en/octopus/cephadm/operations/#configuring-a-different-ssh-user

after step 4 and before step 5 in the adoption guide
https://docs.ceph.com/en/pacific/cephadm/adoption/

and then in step 6 you need to copy the ssh key to your user instead of root

Let me know if that works for you? I will also test things myself if I have
a chance.

-Daniel Pivonka


On Fri, Oct 8, 2021 at 8:44 AM Luis Domingues 
wrote:

> Hello,
>
> On our test cluster, we are running containerized latest pacific, and we
> are testing the upgrade path to cephadm. But we do not want cephadm to use
> the root user to connect to other machines.
>
> We found how to set the ssh-user during bootstrapping, but not when
> adopting an existing cluster.
>
> Is any way to set the ssh-user when adopting a cluster? I did not found
> the way to change the ssh-user on the documentation.
>
> Thanks,
> Luis Domingues
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Determining non-subvolume cephfs snapshot size

2021-10-08 Thread David Prude
Gregory,

   Thank you for taking the time to lay out this information.

-David

On 10/8/21 1:34 PM, Gregory Farnum wrote:
> On Fri, Oct 8, 2021 at 6:44 AM David Prude  wrote:
>> Hello,
>>
>> My apologies if this has been answered previously but by attempt to
>> find an answer have failed me. I am trying to determine the canonical
>> manner for determining how much storage space a cephfs snapshot is
>> consuming. It seems that you can determine the size of the referenced
>> data by pulling the ceph.dir.rbytes attribute for the the snap
>> directory, however there does not seem to be an attribute which
>> indicates the storage the snapshot it's self is consuming:
>>
>> getfattr -d -m - daily_2021-10-07_191702
>> # file: daily_2021-10-07_191702
>> ceph.dir.entries="17"
>> ceph.dir.files="0"
>> ceph.dir.rbytes="6129426031788"
>> ceph.dir.rctime="1633653849.686409000"
>> ceph.dir.rentries="132588"
>> ceph.dir.rfiles="97679"
>> ceph.dir.rsubdirs="34909"
>> ceph.dir.subdirs="17"
> Yeah. Because all the allocations are handled by OSDs, and the OSDs
> and the MDS don't communicate about individual objects, the
> per-snapshot size differential is not actually tracked. Doing so is
> infeasible — it's known only by the OSD and potentially changes on
> every write to the live data, which is far too much communication to
> make happen while keeping any of these systems functional.
>
>> I have found in the documentation references to the command "ceph fs
>> subvolume snapshot info" which should be able to give snapshot size in
>> bytes for a snapshot of a subvolume, however we are not using
>> subvolumes.
> I am reasonably sure this doesn't do what you seem to want, either — I
> think it's just plugging in the rbytes value (much of the subvolume
> API exists so it can plug in to the OpenStack Manila interfaces).
>
>> If we assume a cephfs volume "volume" with a top-level
>> directory "directory" and an associated snapshot "snapshot":
>>
>> volume/directory/.snap/snapshot
>>
>> What is the best way to determine the size consumed by snapshot?
> If you really, REALLY need this, the only approach I can come up with
> is to traverse the snapshot and the live tree and identify changed
> files, and use some heuristic to guess about how much of the data is
> actually changed between them.
>
> But the basic problem is that data usage frequently doesn't belong to
> a snapshot, it belongs to a SET of snapshots, so even if we did the
> data gathering, we can't partition it out between them. If for
> instance your data flow looks like this:
> 
>  -- snapshot 1
> 
>  -- snapshot 2
>  -- snapshot 3
>  -- snapshot 4
> 
>  -- snapshot 5
>
> Then you might say that snapshot 2 is size 4 and snapshots 3 and 4 are
> size 0. But if you delete snapshot 2, you can't actually remove ,
> because it's required for snapshots 3 and 4.
> -Greg
>
>> Thank you,
>>
>> -David
>>
>>
>> --
>> David Prude
>> Systems Administrator
>> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
>> Democracy Now!
>> www.democracynow.org
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
David Prude
Systems Administrator
PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
Democracy Now!
www.democracynow.org

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs vs rbd

2021-10-08 Thread Anthony D'Atri
Please describe the client system.
Are you using the same one for CephFS and RBD?
Kernel version?
BM or VM? KRBD or libvirt/librbd?
Which filesystem did you have on the RBD volume, with what mkfs parameters?


> 
> I was wondering about performance differences between cephfs and rbd, so I 
> deviced this quick test. The results were pretty surprising to me.
> 
> The test: on a very idle machine, make 2 mounts. One is a cephfs mount, the 
> other an rbd mount. In each directory, copy a humongous .tgz file (1.5 TB) 
> and try to untar the file into the directory. The untar on the cephfs 
> directory took slightly over 2 hours, but on the rbd directory it took almost 
> a whole day. I repeated the test 3 times and the results were similar each 
> time. Is there something I'm missing? Is RBD that much slower than cephfs (or 
> is cephfs that much faster than RBD)? Are there any tuning options I can try 
> to improve RBD performance?
> 
>   # df -h | grep mnt
>   10.1.1.150:/  275T  1.5T  273T   1% /mnt/cephfs
>   /dev/rbd0  20T  1.5T   19T   8% /mnt/rbd
> 
>   bash-4.4$ pwd
>   /mnt/cephfs/test
>   bash-4.4$ date; time tar xf exceRptDB_v4_EXOGenomes.tgz; date
>   Fri Jul  2 13:10:01 PDT 2021
> 
>   real137m22.601s
>   user1m6.222s
>   sys 35m57.697s
>   Fri Jul  2 15:27:23 PDT 2021
> 
>   bash-4.4$ pwd
>   /mnt/rbd/test
>   bash-4.4$ date; time tar xf exceRptDB_v4_EXOGenomes.tgz; date
>   Fri Jul  2 15:38:28 PDT 2021
> 
>   real1422m42.236s
>   user1m34.198s
>   sys 38m48.761s
>   Sat Jul  3 15:21:10 PDT 2021
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs + inotify

2021-10-08 Thread Daniel Poelzleithner
On 08/10/2021 21:19, David Rivera wrote:

> I've used inotify against a kernel mount a few months back. Worked fine for
> me if I recall correctly.

It can very much depend on the source of changes. It is easy to imagine
changes originating from localhost get inotify events, while changes
from other hosts might not.

kind regards
 poelzi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs vs rbd

2021-10-08 Thread Jorge Garcia
I was wondering about performance differences between cephfs and rbd, so 
I deviced this quick test. The results were pretty surprising to me.


The test: on a very idle machine, make 2 mounts. One is a cephfs mount, 
the other an rbd mount. In each directory, copy a humongous .tgz file 
(1.5 TB) and try to untar the file into the directory. The untar on the 
cephfs directory took slightly over 2 hours, but on the rbd directory it 
took almost a whole day. I repeated the test 3 times and the results 
were similar each time. Is there something I'm missing? Is RBD that much 
slower than cephfs (or is cephfs that much faster than RBD)? Are there 
any tuning options I can try to improve RBD performance?


  # df -h | grep mnt
  10.1.1.150:/  275T  1.5T  273T   1% /mnt/cephfs
  /dev/rbd0  20T  1.5T   19T   8% /mnt/rbd

  bash-4.4$ pwd
  /mnt/cephfs/test
  bash-4.4$ date; time tar xf exceRptDB_v4_EXOGenomes.tgz; date
  Fri Jul  2 13:10:01 PDT 2021

  real    137m22.601s
  user    1m6.222s
  sys 35m57.697s
  Fri Jul  2 15:27:23 PDT 2021

  bash-4.4$ pwd
  /mnt/rbd/test
  bash-4.4$ date; time tar xf exceRptDB_v4_EXOGenomes.tgz; date
  Fri Jul  2 15:38:28 PDT 2021

  real    1422m42.236s
  user    1m34.198s
  sys 38m48.761s
  Sat Jul  3 15:21:10 PDT 2021



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs + inotify

2021-10-08 Thread David Rivera
I've used inotify against a kernel mount a few months back. Worked fine for
me if I recall correctly.

On Fri, Oct 8, 2021, 08:20 Sean  wrote:

>  I don’t think this is possible, since CephFS is a network mounted
> filesystem. The inotify feature requires the kernel to be aware of file
> system changes. If the kernel is unaware of changes in a tracked directory,
> which is the case for all network mounted filesystems, then it can’t inform
> any watching process.
>
> If you use an RBD image, that should work, however. In that case the kernel
> sees the RDB image as a raw block device, and is in full control of the
> mounted filesystem.
>
> ~ Sean
>
>
> On Oct 6, 2021 at 11:55:25 AM, nORKy  wrote:
>
> > Hi,
> >
> > intotify does not work with cephfs. How can I  make inotify work or build
> > an alternative on my C program ?
> >
> > Thanks you
> >
> > 'Joffrey
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Determining non-subvolume cephfs snapshot size

2021-10-08 Thread Gregory Farnum
On Fri, Oct 8, 2021 at 6:44 AM David Prude  wrote:
>
> Hello,
>
> My apologies if this has been answered previously but by attempt to
> find an answer have failed me. I am trying to determine the canonical
> manner for determining how much storage space a cephfs snapshot is
> consuming. It seems that you can determine the size of the referenced
> data by pulling the ceph.dir.rbytes attribute for the the snap
> directory, however there does not seem to be an attribute which
> indicates the storage the snapshot it's self is consuming:
>
> getfattr -d -m - daily_2021-10-07_191702
> # file: daily_2021-10-07_191702
> ceph.dir.entries="17"
> ceph.dir.files="0"
> ceph.dir.rbytes="6129426031788"
> ceph.dir.rctime="1633653849.686409000"
> ceph.dir.rentries="132588"
> ceph.dir.rfiles="97679"
> ceph.dir.rsubdirs="34909"
> ceph.dir.subdirs="17"

Yeah. Because all the allocations are handled by OSDs, and the OSDs
and the MDS don't communicate about individual objects, the
per-snapshot size differential is not actually tracked. Doing so is
infeasible — it's known only by the OSD and potentially changes on
every write to the live data, which is far too much communication to
make happen while keeping any of these systems functional.

>
> I have found in the documentation references to the command "ceph fs
> subvolume snapshot info" which should be able to give snapshot size in
> bytes for a snapshot of a subvolume, however we are not using
> subvolumes.

I am reasonably sure this doesn't do what you seem to want, either — I
think it's just plugging in the rbytes value (much of the subvolume
API exists so it can plug in to the OpenStack Manila interfaces).

> If we assume a cephfs volume "volume" with a top-level
> directory "directory" and an associated snapshot "snapshot":
>
> volume/directory/.snap/snapshot
>
> What is the best way to determine the size consumed by snapshot?

If you really, REALLY need this, the only approach I can come up with
is to traverse the snapshot and the live tree and identify changed
files, and use some heuristic to guess about how much of the data is
actually changed between them.

But the basic problem is that data usage frequently doesn't belong to
a snapshot, it belongs to a SET of snapshots, so even if we did the
data gathering, we can't partition it out between them. If for
instance your data flow looks like this:

 -- snapshot 1

 -- snapshot 2
 -- snapshot 3
 -- snapshot 4

 -- snapshot 5

Then you might say that snapshot 2 is size 4 and snapshots 3 and 4 are
size 0. But if you delete snapshot 2, you can't actually remove ,
because it's required for snapshots 3 and 4.
-Greg

>
> Thank you,
>
> -David
>
>
> --
> David Prude
> Systems Administrator
> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
> Democracy Now!
> www.democracynow.org
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-08 Thread Szabo, Istvan (Agoda)
Hi Igor,

Here is a bluestore tool fsck output:
https://justpaste.it/7igrb

Is this that you are looking for?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
"utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: 胡 玮文 
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) 
; Igor Fedotov 

Cc: ceph-users@ceph.io
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems indicating you 
don’t have enough memory.

发件人: Szabo, Istvan (Agoda)
发送时间: 2021年10月4日 0:46
收件人: Igor Fedotov
抄送: ceph-users@ceph.io
主题: [ceph-users] Re: is it possible to remove the db+wal from an external 
device (nvme)

Seems like it cannot start anymore once migrated ☹

https://justpaste.it/5hkot

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 

[ceph-users] cephadm adopt with another user than root

2021-10-08 Thread Luis Domingues
Hello,

On our test cluster, we are running containerized latest pacific, and we are 
testing the upgrade path to cephadm. But we do not want cephadm to use the root 
user to connect to other machines.

We found how to set the ssh-user during bootstrapping, but not when adopting an 
existing cluster.

Is any way to set the ssh-user when adopting a cluster? I did not found the way 
to change the ssh-user on the documentation.

Thanks,
Luis Domingues
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs + inotify

2021-10-08 Thread Sean
 I don’t think this is possible, since CephFS is a network mounted
filesystem. The inotify feature requires the kernel to be aware of file
system changes. If the kernel is unaware of changes in a tracked directory,
which is the case for all network mounted filesystems, then it can’t inform
any watching process.

If you use an RBD image, that should work, however. In that case the kernel
sees the RDB image as a raw block device, and is in full control of the
mounted filesystem.

~ Sean


On Oct 6, 2021 at 11:55:25 AM, nORKy  wrote:

> Hi,
>
> intotify does not work with cephfs. How can I  make inotify work or build
> an alternative on my C program ?
>
> Thanks you
>
> 'Joffrey
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Determining non-subvolume cephfs snapshot size

2021-10-08 Thread David Prude
Hello,

    My apologies if this has been answered previously but by attempt to
find an answer have failed me. I am trying to determine the canonical
manner for determining how much storage space a cephfs snapshot is
consuming. It seems that you can determine the size of the referenced
data by pulling the ceph.dir.rbytes attribute for the the snap
directory, however there does not seem to be an attribute which
indicates the storage the snapshot it's self is consuming:

getfattr -d -m - daily_2021-10-07_191702
# file: daily_2021-10-07_191702
ceph.dir.entries="17"
ceph.dir.files="0"
ceph.dir.rbytes="6129426031788"
ceph.dir.rctime="1633653849.686409000"
ceph.dir.rentries="132588"
ceph.dir.rfiles="97679"
ceph.dir.rsubdirs="34909"
ceph.dir.subdirs="17"

I have found in the documentation references to the command "ceph fs
subvolume snapshot info" which should be able to give snapshot size in
bytes for a snapshot of a subvolume, however we are not using
subvolumes. If we assume a cephfs volume "volume" with a top-level
directory "directory" and an associated snapshot "snapshot":

volume/directory/.snap/snapshot

What is the best way to determine the size consumed by snapshot?

Thank you,

-David


-- 
David Prude
Systems Administrator
PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
Democracy Now!
www.democracynow.org


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs _allocate unable to allocate

2021-10-08 Thread Igor Fedotov

And was osd.2 redeployed AFTER settings had been reset to defaults ?

Anything particular about current cluster use cases?

E.g. is it a sort of regular usage (with load flukes and peak) or may be 
some permanently running stress load testing. The latter might tend to 
hold the resources and e.g. prevent from internal house keeping...


Igor


On 10/8/2021 12:16 AM, José H. Freidhof wrote:


Hi Igor,

yes the same problem is on osd.2

we have 3 OSD Nodes... Each Node has 20 Bluestore OSDs ... in total we 
have 60 OSDs
i checked right now one node... and 15 of 20 OSDs have this problem 
and error in the log.


the settings that you have complained some emails ago .. i have 
reverted them to default.


ceph.conf file:

[global]
        fsid = 462c44b4-eed6-11eb-8b2c-a1ad45f88a97
        mon_host = [v2:10.50.50.21:3300/0,v1:10.50.50.21:6789/0 
] 
[v2:10.50.50.22:3300/0,v1:10.50.50.22:6789/0 
] 
[v2:10.50.50.20:3300/0,v1:10.50.50.20:6789/0 
]

        log file = /var/log/ceph/$cluster-$type-$id.log
        max open files = 131072
        mon compact on trim = False
        osd deep scrub interval = 137438953472
        osd max scrubs = 16
        osd objectstore = bluestore
        osd op threads = 2
        osd scrub load threshold = 0.01
        osd scrub max interval = 137438953472
        osd scrub min interval = 137438953472
        perf = True
        rbd readahead disable after bytes = 0
        rbd readahead max bytes = 4194304
        throttler perf counter = False

[client]
        rbd cache = False


[mon]
        mon health preluminous compat = True
        mon osd down out interval = 300

[osd]
        bluestore cache autotune = 0
        bluestore cache kv ratio = 0.2
        bluestore cache meta ratio = 0.8
        bluestore extent map shard max size = 200
        bluestore extent map shard min size = 50
        bluestore extent map shard target size = 100
        bluestore rocksdb options = 
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB

        osd map share max epochs = 100
        osd max backfills = 5
        osd op num shards = 8
        osd op num threads per shard = 2
        osd min pg log entries = 10
        osd max pg log entries = 10
        osd pg log dups tracked = 10
        osd pg log trim min = 10



root@cd133-ceph-osdh-01:~# ceph config dump
WHO                                               MASK           
 LEVEL     OPTION   VALUE                                             
         RO
global            basic     container_image 
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb 
 
                       *

global            advanced  leveldb_max_open_files   131072
global            advanced  mon_compact_on_trim  false
global            dev       ms_crc_data  false
global            advanced  osd_deep_scrub_interval  1209600.00
global            advanced  osd_max_scrubs   16
global            advanced  osd_scrub_load_threshold   0.01
global            advanced  osd_scrub_max_interval   1209600.00
global            advanced  osd_scrub_min_interval   86400.00
global            advanced  perf   true
global            advanced  rbd_readahead_disable_after_bytes  0
global            advanced  rbd_readahead_max_bytes  4194304
global            advanced  throttler_perf_counter   false
  mon             advanced  auth_allow_insecure_global_id_reclaim   
 false
  mon             advanced  cluster_network 10.50.50.0/24 
                                      *

  mon             advanced  mon_osd_down_out_interval    300
  mon             advanced  public_network 10.50.50.0/24 
                                      *
  mgr             advanced  mgr/cephadm/container_init   True         
                                            *
  mgr             advanced  mgr/cephadm/device_enhanced_scan   true   
                                                  *
  mgr             advanced  mgr/cephadm/migration_current    2         
                                             *
  mgr             advanced  mgr/cephadm/warn_on_stray_daemons    false 
                                                     *
  mgr             advanced  mgr/cephadm/warn_on_stray_hosts    false   
                                                   

[ceph-users] Re: Edit crush rule

2021-10-08 Thread ceph-users
Excellent - thank you!


From: Konstantin Shalygin 'k0ste at k0ste.ru' 

Sent: 07 October 2021 10:10
To: ceph-us...@hovr.anonaddy.com 
Subject: Re: [ceph-users] Edit crush rule

Hi,

On 7 Oct 2021, at 11:03, 
ceph-us...@hovr.anonaddy.com wrote:

Following on this, are there any restriction or issues with setting a new rule 
on a pool, except for the resulting backfilling?

Nope


I can't see anything specific about it in the documentation, just the command 
ceph osd pool set {pool-name} crush_rule with the description


crush_rule

Description

The rule to use for mapping object placement in the cluster.

Type

String

Does it work equally (well) for replicated and ec, changing m/k values, etc?

For EC changing crush rule is not trival - you can't change EC profile after 
pool deployment. For replicated pool - in any time



k




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Broken mon state after (attempted) 16.2.5 -> 16.2.6 upgrade

2021-10-08 Thread Jonathan D. Proulx
Hi Patrick,

Yes we had been successfully running on Pacific  v16.2.5

Thanks for the pointer to the bug, we eventually ended up taking
eveything down and rebuilding the monstore using
monstore-tool. Perhaps a longer and less pleasant path than necessary
but it was effective.

-Jon

On Thu, Oct 07, 2021 at 09:11:21PM -0400, Patrick Donnelly wrote:
:Hello Jonathan,
:
:On Tue, Oct 5, 2021 at 9:13 AM Jonathan D. Proulx  wrote:
:>
:> In the middle of a normal cephadm upgrade from 16.2.5 to 16.2.6, after the 
mgrs had successfully upgraded, 2/5 mons didn’t come back up (and the upgrade 
stopped at that point). Attempting to manually restart the crashed mons 
resulted in **all** of the other mons crashing too, usually with:
:>
:> terminate called after throwing an instance of 
'ceph::buffer::v15_2_0::malformed_input' what(): void 
FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer 
understand old encoding version v < 7: Malformed input
:
:You upgraded from v16.2.5 and not Octopus? I would expect your cluster
:to crash when upgrading to any version of Pacific:
:
:https://tracker.ceph.com/issues/51673
:
:Only the crash error has changed from an assertion to an exception.
:
:-- 
:Patrick Donnelly, Ph.D.
:He / Him / His
:Principal Software Engineer
:Red Hat Sunnyvale, CA
:GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
:

-- 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io