[ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

I have a question about tcp connection.
In the test environment, openstack uses ceph RBD as backend storage.
I created a VM and attache a volume/image to the VM.
I monitored how many fd was used by Qemu process.
I used the command dd to fill the whole volume/image.
I found that the FD count was increased, and stable at a fixed value after
some time.

I think when reading/writing to volume/image, tcp connection needs to be
established which needs FD, then the FD count may increase.
But after reading/writing, why the FD count doesn't descrease?

Thanks in advance.
BR.
Yafeng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread Eliza

Hi,

on 2019/8/19 16:10, fengyd wrote:
I think when reading/writing to volume/image, tcp connection needs to be 
established which needs FD, then the FD count may increase.

But after reading/writing, why the FD count doesn't descrease?


The tcp may be long connections.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread huang jun
how long do you monitor after r/w finish?
there is a configure item named 'ms_connection_idle_timeout' which
default value is 900

fengyd  于2019年8月19日周一 下午4:10写道:
>
> Hi,
>
> I have a question about tcp connection.
> In the test environment, openstack uses ceph RBD as backend storage.
> I created a VM and attache a volume/image to the VM.
> I monitored how many fd was used by Qemu process.
> I used the command dd to fill the whole volume/image.
> I found that the FD count was increased, and stable at a fixed value after 
> some time.
>
> I think when reading/writing to volume/image, tcp connection needs to be 
> established which needs FD, then the FD count may increase.
> But after reading/writing, why the FD count doesn't descrease?
>
> Thanks in advance.
> BR.
> Yafeng
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
-how long do you monitor after r/w finish?
More than 900 seconds.

I executed the following command last Saturday and today, the output was
same.
sudo lsof -p 5509 | wc -l

And the result from /proc:
ls -ltr /proc/5509/fd | grep socket | grep "Aug 13" | wc -l
134
 sudo ls -ltr /proc/5509/fd | grep socket | grep "Aug 19" | wc -l
0

In which configuration file can I find ms_connection_idle_timeout?

On Mon, 19 Aug 2019 at 16:26, huang jun  wrote:

> how long do you monitor after r/w finish?
> there is a configure item named 'ms_connection_idle_timeout' which
> default value is 900
>
> fengyd  于2019年8月19日周一 下午4:10写道:
> >
> > Hi,
> >
> > I have a question about tcp connection.
> > In the test environment, openstack uses ceph RBD as backend storage.
> > I created a VM and attache a volume/image to the VM.
> > I monitored how many fd was used by Qemu process.
> > I used the command dd to fill the whole volume/image.
> > I found that the FD count was increased, and stable at a fixed value
> after some time.
> >
> > I think when reading/writing to volume/image, tcp connection needs to be
> established which needs FD, then the FD count may increase.
> > But after reading/writing, why the FD count doesn't descrease?
> >
> > Thanks in advance.
> > BR.
> > Yafeng
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Correct number of pg

2019-08-19 Thread Jake Grimmett
Dear All,

We have a new Nautilus cluster, used for cephfs, with pg_autoscaler in
warn mode.

Shortly after hitting 62% full, the autoscaler started warning that we
have too few pg:

*
Pool ec82pool has 4096 placement groups, should have 16384
*

The pool is 62% full, we have 450 OSD, and are using 8 k=8 m=2 Erasure
encoding.

Does 16384 pg seem reasonable?

The on-line pg calculator suggests 4096...

https://ceph.io/pgcalc/

(Size = 10, OSD=450, %Data=100, Target OSD 100)

many thanks,

Jake

-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
I collected the lsof at different time and found that:
The total number of open FD is stable at a fixed value, and some of tcp
connection are changed.


On Mon, 19 Aug 2019 at 16:42, fengyd  wrote:

> -how long do you monitor after r/w finish?
> More than 900 seconds.
>
> I executed the following command last Saturday and today, the output was
> same.
> sudo lsof -p 5509 | wc -l
>
> And the result from /proc:
> ls -ltr /proc/5509/fd | grep socket | grep "Aug 13" | wc -l
> 134
>  sudo ls -ltr /proc/5509/fd | grep socket | grep "Aug 19" | wc -l
> 0
>
> In which configuration file can I find ms_connection_idle_timeout?
>
> On Mon, 19 Aug 2019 at 16:26, huang jun  wrote:
>
>> how long do you monitor after r/w finish?
>> there is a configure item named 'ms_connection_idle_timeout' which
>> default value is 900
>>
>> fengyd  于2019年8月19日周一 下午4:10写道:
>> >
>> > Hi,
>> >
>> > I have a question about tcp connection.
>> > In the test environment, openstack uses ceph RBD as backend storage.
>> > I created a VM and attache a volume/image to the VM.
>> > I monitored how many fd was used by Qemu process.
>> > I used the command dd to fill the whole volume/image.
>> > I found that the FD count was increased, and stable at a fixed value
>> after some time.
>> >
>> > I think when reading/writing to volume/image, tcp connection needs to
>> be established which needs FD, then the FD count may increase.
>> > But after reading/writing, why the FD count doesn't descrease?
>> >
>> > Thanks in advance.
>> > BR.
>> > Yafeng
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does CephFS find a file?

2019-08-19 Thread aot...@outlook.com
I am a student new to cephfs. I think there are 2 steps to finding a file:

1.Find out which objects belong to this file.

2.Use CRUSH to find out OSDs.

What I don’t know is how does CephFS get the object list of the file. Does MDS 
save all object list of all files? Or CRUSH can use some information(what 
information?) to calculate the list of objects? In other words, where is the 
object list of the file saved?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Correct number of pg

2019-08-19 Thread Paul Emmerich
On Mon, Aug 19, 2019 at 10:51 AM Jake Grimmett  wrote:
>
> Dear All,
>
> We have a new Nautilus cluster, used for cephfs, with pg_autoscaler in
> warn mode.
>
> Shortly after hitting 62% full, the autoscaler started warning that we
> have too few pg:
>
> *
> Pool ec82pool has 4096 placement groups, should have 16384
> *
>
> The pool is 62% full, we have 450 OSD, and are using 8 k=8 m=2 Erasure
> encoding.
>
> Does 16384 pg seem reasonable?

no, that would be a horrible value for a cluster of that size, 4096 is
perfect here.


Paul

>
> The on-line pg calculator suggests 4096...
>
> https://ceph.io/pgcalc/
>
> (Size = 10, OSD=450, %Data=100, Target OSD 100)
>
> many thanks,
>
> Jake
>
> --
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Correct number of pg

2019-08-19 Thread Jake Grimmett
Wonderful, we will leave our pg at 4096 :)

many thanks for the advice Paul :)

have a good day,

Jake

On 8/19/19 11:03 AM, Paul Emmerich wrote:
> On Mon, Aug 19, 2019 at 10:51 AM Jake Grimmett  wrote:
>>
>> Dear All,
>>
>> We have a new Nautilus cluster, used for cephfs, with pg_autoscaler in
>> warn mode.
>>
>> Shortly after hitting 62% full, the autoscaler started warning that we
>> have too few pg:
>>
>> *
>> Pool ec82pool has 4096 placement groups, should have 16384
>> *
>>
>> The pool is 62% full, we have 450 OSD, and are using 8 k=8 m=2 Erasure
>> encoding.
>>
>> Does 16384 pg seem reasonable?
> 
> no, that would be a horrible value for a cluster of that size, 4096 is
> perfect here.
> 
> 
> Paul
> 
>>
>> The on-line pg calculator suggests 4096...
>>
>> https://ceph.io/pgcalc/
>>
>> (Size = 10, OSD=450, %Data=100, Target OSD 100)
>>
>> many thanks,
>>
>> Jake
>>
>> --
>> MRC Laboratory of Molecular Biology
>> Francis Crick Avenue,
>> Cambridge CB2 0QH, UK.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Lars Täuber
Hi all!

Where can I look up what the error number means?
Or did I something wrong in my command line?

Thanks in advance,
Lars

Fri, 16 Aug 2019 13:31:38 +0200
Lars Täuber  ==> Paul Emmerich  :
> Hi Paul,
> 
> thank you for your help. But I get the following error:
> 
> # ceph tell mds.mds3 scrub start 
> "~mds0/stray7/15161f7/dovecot.index.backup" repair
> 2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> {
> "return_code": -116
> }
> 
> 
> 
> Lars
> 
> 
> Fri, 16 Aug 2019 13:17:08 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Hi,
> > 
> > damage_type backtrace is rather harmless and can indeed be repaired
> > with the repair command, but it's called scrub_path.
> > Also you need to pass the name and not the rank of the MDS as id, it should 
> > be
> > 
> > # (on the server where the MDS is actually running)
> > ceph daemon mds.mds3 scrub_path ...
> > 
> > But you should also be able to use ceph tell since nautilus which is a
> > little bit easier because it can be run from any node:
> > 
> > ceph tell mds.mds3 scrub start 'PATH' repair
> > 
> > 
> > Paul
> >   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Paul Emmerich
Hi,

that error just says that the path is wrong. I unfortunately don't
know the correct way to instruct it to scrub a stray path off the top
of my head; you can always run a recursive scrub on / to go over
everything, though


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Aug 19, 2019 at 12:55 PM Lars Täuber  wrote:
>
> Hi all!
>
> Where can I look up what the error number means?
> Or did I something wrong in my command line?
>
> Thanks in advance,
> Lars
>
> Fri, 16 Aug 2019 13:31:38 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> >
> > thank you for your help. But I get the following error:
> >
> > # ceph tell mds.mds3 scrub start 
> > "~mds0/stray7/15161f7/dovecot.index.backup" repair
> > 2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > 2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > {
> > "return_code": -116
> > }
> >
> >
> >
> > Lars
> >
> >
> > Fri, 16 Aug 2019 13:17:08 +0200
> > Paul Emmerich  ==> Lars Täuber  :
> > > Hi,
> > >
> > > damage_type backtrace is rather harmless and can indeed be repaired
> > > with the repair command, but it's called scrub_path.
> > > Also you need to pass the name and not the rank of the MDS as id, it 
> > > should be
> > >
> > > # (on the server where the MDS is actually running)
> > > ceph daemon mds.mds3 scrub_path ...
> > >
> > > But you should also be able to use ceph tell since nautilus which is a
> > > little bit easier because it can be run from any node:
> > >
> > > ceph tell mds.mds3 scrub start 'PATH' repair
> > >
> > >
> > > Paul
> > >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Balancer code

2019-08-19 Thread Burkhard Linke

Hi,


On 8/18/19 12:06 AM, EDH - Manuel Rios Fernandez wrote:


Hi ,

Whats the reason for not allow balancer PG if objects are 
inactive/misplaced at least in nautilus 14.2.2 ?


https://github.com/ceph/ceph/blob/master/src/pybind/mgr/balancer/module.py#L874



*snipsnap*


We can understood that balancer cant work with unknow pgs states and 
inactive states. But… missing and misplaced…




The degraded state indicates that some data is missing within the pg, or 
one replicate is not up to date. See below for an example.


Hope some developer can clarify that. This lines cause a lot of 
problem at least in nautilus 14.2.2


Case example:

  * Pool Size 1, upgraded to Size 2. Cluster become Warning with
misplaced and degraded. Some objects are don’t recovery from
degraded state due “OSD backfullfill_toofull “due OSDs became full
instead of even distributed and balanced, because balancer code
exclude it.

Updating to size 2 requires all PGs to have two replicates. After 
changing the size settings, the PGs will be undersized+degraded, since 
only one instance exists (->undersized). The second replicate will be 
created during the backfilling, and after the complete content of the PG 
the state will change to active+clean.


Degraded state can also happen during restart of an OSD. If write are 
not blocked during the restart (e.g. there are enough replicates 
active), the other instances of a PG will have updated data. This data 
needs to be replicated to the restarted OSD after it is available again. 
A similar situation happens during balancing or moving PG in general; if 
a PG is not transfered completely yet, new writes may be send to either 
the old set of OSDs (and need to be backfilling afterwards), or send to 
the new set (and are considered degraded, since from the a point-in-time 
view of the cluster they are not present on the acting set of OSDs). I'm 
not 100% sure which way is implemented in Ceph, gut feelings point to 
the later one.



Degraded thus refers to a state where a PG does not fulfill its 
replication requirements and should thus be handled as an error or 
warning state. And you do not want the balancer to interfere with this 
state.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Lars Täuber
Hi Paul,

thanks for the hint.

I did a recursive scrub from "/". The log says there where some inodes with bad 
backtraces repaired. But the error remains.
May this have something to do with a deleted file? Or a file within a snapshot?

The path told by

# ceph tell mds.mds3 damage ls
2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
[
{
"damage_type": "backtrace",
"id": 3760765989,
"ino": 1099518115802,
"path": "~mds0/stray7/15161f7/dovecot.index.backup"
}
]

starts a bit strange to me.

Are the snapshots also repaired with a recursive repair operation?

Thanks
Lars


Mon, 19 Aug 2019 13:30:53 +0200
Paul Emmerich  ==> Lars Täuber  :
> Hi,
> 
> that error just says that the path is wrong. I unfortunately don't
> know the correct way to instruct it to scrub a stray path off the top
> of my head; you can always run a recursive scrub on / to go over
> everything, though
> 
> 
> Paul
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CephFS find a file?

2019-08-19 Thread Robert LeBlanc
I'm fairly new to CephFS, but in my poking around with it, this is what I
understand.

The MDS manages dentries as omap (simple key/value database) entries in the
metadada pool. Each dentry keeps a list of filenames and some metadata
about the file such as inode number and some other info such as size I
presume (can't find a documentation outlining the binary format of the
omap, just did enough digging to find the inode location). The MDS can
return the inode and size to the client and the client looks up the OSDs
for the inode using the CRUSH map and dividing the size by the stripe size
to know how many objects to fetch for the whole object. The file is stored
by the inode (in hex) appended by the object offset. The inode corresponds
to the same value in `ls -li` in CephFS converted to hex.

I hope that is correct and useful as a starting point for you.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Aug 19, 2019 at 2:37 AM aot...@outlook.com 
wrote:

> I am a student new to cephfs. I think there are 2 steps to finding a file:
>
> 1.Find out which objects belong to this file.
>
> 2.Use CRUSH to find out OSDs.
>
>
>
> What I don’t know is how does CephFS get the object list of the file. Does
> MDS save all object list of all files? Or CRUSH can use some
> information(what information?) to calculate the list of objects? In other
> words, where is the object list of the file saved?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] lz4 compression?

2019-08-19 Thread Jake Grimmett
Dear all,

I've not seen posts from people using LZ4 compression, and wondered what
other peoples experiences are if they have tried LZ4 on Nautilus.

Since enabling LZ4 we have copied 1.9 PB into a pool without problem.

However, and if "ceph df detail" is accurate, we are not getting much
compression. Using LZ4 with ZFS we normally expect to see ~1.2x with
this data.

[root@ceph-s1 ~]# ceph df detail

POOL ID STORED  OBJECTS USED%USED
ec82pool 2 1.5 PiB 559.16M 1.9 PiB 62.66

MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY
914 TiB N/A   N/A 559.16M

USED COMPR UNDER COMPR
22 TiB  41 TiB

Compression enabled as follows.
# ceph osd pool set ec82pool compression_algorithm lz4
set pool 2 compression_algorithm to lz4
# ceph osd pool set ec82pool compression_mode force
set pool 2 compression_mode to force

Any thoughts?

Jake

-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CephFS find a file?

2019-08-19 Thread Patrick Donnelly
On Mon, Aug 19, 2019 at 7:50 AM Robert LeBlanc  wrote:
> The MDS manages dentries as omap (simple key/value database) entries in the 
> metadada pool. Each dentry keeps a list of filenames and some metadata about 
> the file such as inode number and some other info such as size I presume 
> (can't find a documentation outlining the binary format of the omap, just did 
> enough digging to find the inode location).

Each directory (actually: directory fragment) is a single object in
the metadata pool. They are indexed by inode number. Root is always
inode 1 and can be used as a starting point for finding any other
directory (since the file system hierarchy is a tree). (Note: some
special directories exist outside the file system tree, like the stray
directories.)

The value in the omap Robert refers to is the binary encoded inode. It
will include the inode number, file layout (!) [1], and size. All
three of these pieces of information are necessary to find a file's
data or write new data.

> The MDS can return the inode and size

and file layout*

> to the client and the client looks up the OSDs for the inode using the CRUSH 
> map and dividing the size by the stripe size to know how many objects to 
> fetch for the whole object.

The file layout and the inode number determine where a particular
block can be found. This is all encoded in the name of the object
within the data pool.

[1] https://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] latency on OSD

2019-08-19 Thread Davis Mendoza Paco
Hi all,
I have installed ceph luminous, witch 5 nodes(45 OSD)

* 5 ceph-osd
  network: bond lacp 10GB
  RAM: 96GB
  HD: 9 disk SATA-3TB (bluestore)

I wanted to ask for help to fix the latency of the osd "ceph osd perf"

You who recommend me?


My config is:

/etc/ceph/ceph.conf

[global]
fsid = 414507dd-8a16-4548-86b7-906b0c9905e1
mon_initial_members = controller01,controller02,controller03
mon_host = 192.168.13.11,192.168.13.12,192.168.13.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public network = 192.168.13.0/24
cluster network = 192.168.10.0/24

osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_flag_hashpspool = true

[osd]
osd_scrub_begin_hour = 22
osd_scrub_end_hour = 6


---
ceph osd perf

osd commit_latency(ms) apply_latency(ms)
  0 4949
  1120   120
  2 3636
  3 6565
  4 1919
  5 5757
  6112   112
  7 5353
  8159   159
  9226   226
 10 2121
 11 7979
 12 5050
 13133   133
 14105   105
 15 6565
 16 3232
 17 6464
 18 6262
 19 7878
 20 7171
 21 9797
 22168   168
 23108   108
 24119   119
 25219   219
 26144   144
 27 2626
 28 7676
 29176   176
 30 2323
 31 9191
 32 3030
 33 6464
 34 2121
 35 7373
 36124   124
 37 8585
 38 3939
 39 3636
 40 2727
 41 3333
 42 4949
 43 2222
 44 4444


-- 
*Davis Mendoza P.*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] latency on OSD

2019-08-19 Thread Vitaliy Filippov

We recommend you SSD


Hi all,
I have installed ceph luminous, witch 5 nodes(45 OSD)

* 5 ceph-osd
  network: bond lacp 10GB
  RAM: 96GB
  HD: 9 disk SATA-3TB (bluestore)

I wanted to ask for help to fix the latency of the osd "ceph osd perf"

You who recommend me?


My config is:

/etc/ceph/ceph.conf

[global]
fsid = 414507dd-8a16-4548-86b7-906b0c9905e1
mon_initial_members = controller01,controller02,controller03
mon_host = 192.168.13.11,192.168.13.12,192.168.13.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public network = 192.168.13.0/24
cluster network = 192.168.10.0/24

osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_pool_default_flag_hashpspool = true

[osd]
osd_scrub_begin_hour = 22
osd_scrub_end_hour = 6


---
ceph osd perf

osd commit_latency(ms) apply_latency(ms)
  0 4949
  1120   120
  2 3636
  3 6565
  4 1919
  5 5757
  6112   112
  7 5353
  8159   159
  9226   226
 10 2121
 11 7979
 12 5050
 13133   133
 14105   105
 15 6565
 16 3232
 17 6464
 18 6262
 19 7878
 20 7171
 21 9797
 22168   168
 23108   108
 24119   119
 25219   219
 26144   144
 27 2626
 28 7676
 29176   176
 30 2323
 31 9191
 32 3030
 33 6464
 34 2121
 35 7373
 36124   124
 37 8585
 38 3939
 39 3636
 40 2727
 41 3333
 42 4949
 43 2222
 44 4444





--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-19 Thread Troy Ablan
While I'm still unsure how this happened, this is what was done to solve 
this.


Started OSD in foreground with debug 10, watched for the most recent 
osdmap epoch mentioned before abort().  For example, if it mentioned 
that it just tried to load 80896 and then crashed


# ceph osd getmap -o osdmap.80896 80896
# ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-77/ --file osdmap.80896


Then I restarted the osd in foreground/debug, and repeated for the next 
osdmap epoch until it got past the first few seconds.  This process 
worked for all but two OSDs.  For the ones that succeeded I'd ^C and 
then start the osd via systemd


For the remaining two, it would try loading the incremental map and then 
crash.  I had presence of mind to make dd images of every OSD before 
starting this process, so I reverted these two to the state before 
injecting the osdmaps.


I then injected the last 15 or so epochs of the osdmap in sequential 
order before starting the osd, with success.


This leads me to believe that the step-wise injection didn't work 
because the osd had more subtle corruption that it got past, but it was 
confused when it requested the next incremental delta.


Thanks again to Brad/badone for the guidance!

Tracker issue updated.

Here's the closing IRC dialogue re this issue (UTC-0700)

2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out 
yesterday, you've helped a ton, twice now :)  I'm still concerned 
because I don't know how this happened.  I'll feel better once 
everything's active+clean, but it's all at least active.


2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with 
Josh earlier and he shares my opinion this is likely somehow related to 
these drives or perhaps controllers, or at least specific to these machines


2019-08-19 16:31:04 < badone> however, there is a possibility you are 
seeing some extremely rare race that no one up to this point has seen before


2019-08-19 16:31:20 < badone> that is less likely though

2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire 
successfully but wrote it out to disk in a format that it could not then 
read back in (unlikely) or...


2019-08-19 16:33:21 < badone> the map "changed" after it had been 
written to disk


2019-08-19 16:33:46 < badone> the second is considered most likely by us 
but I recognise you may not share that opinion

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

I checked the FD information with the command "ls -l /proc/25977/fd"  //
here 25977 is Qemu process.
I found that the creation timestamp of  the FD was not changed, but the
socket information to which the FD was linked was changed.
So, I guess the FD is reused when establishing new tcp connection.

[image: image.png]

On Tue, 20 Aug 2019 at 04:11, Eliza  wrote:

> Hi,
>
> on 2019/8/19 16:10, fengyd wrote:
> > I think when reading/writing to volume/image, tcp connection needs to be
> > established which needs FD, then the FD count may increase.
> > But after reading/writing, why the FD count doesn't descrease?
>
> The tcp may be long connections.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread Eliza


on 2019/8/20 9:54, fengyd wrote:
I checked the FD information with the command "ls -l /proc/25977/fd"  // 
here 25977 is Qemu process.
I found that the creation timestamp of  the FD was not changed, but the 
socket information to which the FD was linked was changed.

So, I guess the FD is reused when establishing new tcp connection.


I alomost got a lot of tcp connections from host mounted with block 
devices to ceph's backend.


regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: How does CephFS find a file?

2019-08-19 Thread 青鸟 千秋
Thank you very much! I understand it now.

发件人: Patrick Donnelly
发送时间: 2019年8月20日 4:35
收件人: Robert LeBlanc
抄送: aot...@outlook.com; 
ceph-users@lists.ceph.com
主题: Re: [ceph-users] How does CephFS find a file?

On Mon, Aug 19, 2019 at 7:50 AM Robert LeBlanc  wrote:
> The MDS manages dentries as omap (simple key/value database) entries in the 
> metadada pool. Each dentry keeps a list of filenames and some metadata about 
> the file such as inode number and some other info such as size I presume 
> (can't find a documentation outlining the binary format of the omap, just did 
> enough digging to find the inode location).

Each directory (actually: directory fragment) is a single object in
the metadata pool. They are indexed by inode number. Root is always
inode 1 and can be used as a starting point for finding any other
directory (since the file system hierarchy is a tree). (Note: some
special directories exist outside the file system tree, like the stray
directories.)

The value in the omap Robert refers to is the binary encoded inode. It
will include the inode number, file layout (!) [1], and size. All
three of these pieces of information are necessary to find a file's
data or write new data.

> The MDS can return the inode and size

and file layout*

> to the client and the client looks up the OSDs for the inode using the CRUSH 
> map and dividing the size by the stripe size to know how many objects to 
> fetch for the whole object.

The file layout and the inode number determine where a particular
block can be found. This is all encoded in the name of the object
within the data pool.

[1] https://docs.ceph.com/docs/master/cephfs/file-layouts/

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

1. Create a VM and a volume, and attach the volume to VM.
   Check the FD count with lsof and the FD count is increased by 10.
2.  Fill the volume with dd command on the VM
   Check the FD count with lsof and the FD count is increased dramatically
and stable after the FD count is increased by 48(48 is the exact number of
OSDs)

If the creation timestamp of  the FD is not changed, but the socket
information to which the FD was linked is changed, it means new tcp
connection is established.
If there's no reading/wring ongoing,  why new tcp connection is still
established and the FD count is stable?

Br.
Yafeng

On Tue, 20 Aug 2019 at 10:07, Eliza  wrote:

>
> on 2019/8/20 9:54, fengyd wrote:
> > I checked the FD information with the command "ls -l /proc/25977/fd"  //
> > here 25977 is Qemu process.
> > I found that the creation timestamp of  the FD was not changed, but the
> > socket information to which the FD was linked was changed.
> > So, I guess the FD is reused when establishing new tcp connection.
>
> I alomost got a lot of tcp connections from host mounted with block
> devices to ceph's backend.
>
> regards.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread Eliza

Hi

on 2019/8/20 10:30, fengyd wrote:
If the creation timestamp of  the FD is not changed, but the socket 
information to which the FD was linked is changed, it means new tcp 
connection is established.
If there's no reading/wring ongoing,  why new tcp connection is still 
established and the FD count is stable?


Though I am just a ceph user not the expert, but I think each block 
device as the client who is involved into CRUSH algorithm for data 
rebalancing etc, so long connections between client and OSDs are kept.


regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

Long connections means new tcp connection which connect the same targets is
reestablished after timeout?


On Tue, 20 Aug 2019 at 10:37, Eliza  wrote:

> Hi
>
> on 2019/8/20 10:30, fengyd wrote:
> > If the creation timestamp of  the FD is not changed, but the socket
> > information to which the FD was linked is changed, it means new tcp
> > connection is established.
> > If there's no reading/wring ongoing,  why new tcp connection is still
> > established and the FD count is stable?
>
> Though I am just a ceph user not the expert, but I think each block
> device as the client who is involved into CRUSH algorithm for data
> rebalancing etc, so long connections between client and OSDs are kept.
>
> regards.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread Eliza




on 2019/8/20 10:57, fengyd wrote:
Long connections means new tcp connection which connect the same targets 
is reestablished after timeout?


yes, once timeouted, then reconnecting.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

I think you're right.

thanks.

Br.
Yafeng

On Tue, 20 Aug 2019 at 10:59, Eliza  wrote:

>
>
> on 2019/8/20 10:57, fengyd wrote:
> > Long connections means new tcp connection which connect the same targets
> > is reestablished after timeout?
>
> yes, once timeouted, then reconnecting.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread Eliza

Hi

on 2019/8/20 11:00, fengyd wrote:

I think you're right.


I am not so sure about it. But I think ceph client always wants to know 
the cluster's topology, so it needs to communicate with cluster all the 
time. The big difference for ceph to other distributed storage is 
clients participate into cluster's calculations.


I think you know Chinese? just googled out this one:
http://blog.dnsbed.com/?p=1685

regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How RBD tcp connection works

2019-08-19 Thread fengyd
Hi,

Thanks

Br.
Yafeng

On Tue, 20 Aug 2019 at 11:14, Eliza  wrote:

> Hi
>
> on 2019/8/20 11:00, fengyd wrote:
> > I think you're right.
>
> I am not so sure about it. But I think ceph client always wants to know
> the cluster's topology, so it needs to communicate with cluster all the
> time. The big difference for ceph to other distributed storage is
> clients participate into cluster's calculations.
>
> I think you know Chinese? just googled out this one:
> http://blog.dnsbed.com/?p=1685
>
> regards.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-19 Thread Lars Täuber
Hi there!

Does anyone else have an idea what I could do to get rid of this error?

BTW: it is the third time that the pg 20.0 is gone inconsistent.
This is a pg from the metadata pool (cephfs).
May this be related anyhow?

# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data 
damage: 1 pg inconsistent
MDS_DAMAGE 1 MDSs report damaged metadata
mdsmds3(mds.0): Metadata damage detected
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 20.0 is active+clean+inconsistent, acting [9,27,15]


Best regards,
Lars


Mon, 19 Aug 2019 13:51:59 +0200
Lars Täuber  ==> Paul Emmerich  :
> Hi Paul,
> 
> thanks for the hint.
> 
> I did a recursive scrub from "/". The log says there where some inodes with 
> bad backtraces repaired. But the error remains.
> May this have something to do with a deleted file? Or a file within a 
> snapshot?
> 
> The path told by
> 
> # ceph tell mds.mds3 damage ls
> 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> [
> {
> "damage_type": "backtrace",
> "id": 3760765989,
> "ino": 1099518115802,
> "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> }
> ]
> 
> starts a bit strange to me.
> 
> Are the snapshots also repaired with a recursive repair operation?
> 
> Thanks
> Lars
> 
> 
> Mon, 19 Aug 2019 13:30:53 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Hi,
> > 
> > that error just says that the path is wrong. I unfortunately don't
> > know the correct way to instruct it to scrub a stray path off the top
> > of my head; you can always run a recursive scrub on / to go over
> > everything, though
> > 
> > 
> > Paul
> >   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SOLVED - MDSs report damaged metadata

2019-08-19 Thread Lars Täuber
Hi all!

I solved this situation with restarting the active mds. So the next mds took 
over and the error was gone.

This is somehow a strange situation. Similar to the situation when restarting 
primary osds when having scrub errors of pgs.

Maybe this should be researched a bit deeper.

Thanks all for this great storage solution!

Cheers,
Lars


Tue, 20 Aug 2019 07:30:11 +0200
Lars Täuber  ==> ceph-users@lists.ceph.com :
> Hi there!
> 
> Does anyone else have an idea what I could do to get rid of this error?
> 
> BTW: it is the third time that the pg 20.0 is gone inconsistent.
> This is a pg from the metadata pool (cephfs).
> May this be related anyhow?
> 
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data 
> damage: 1 pg inconsistent
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 20.0 is active+clean+inconsistent, acting [9,27,15]
> 
> 
> Best regards,
> Lars
> 
> 
> Mon, 19 Aug 2019 13:51:59 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> > 
> > thanks for the hint.
> > 
> > I did a recursive scrub from "/". The log says there where some inodes with 
> > bad backtraces repaired. But the error remains.
> > May this have something to do with a deleted file? Or a file within a 
> > snapshot?
> > 
> > The path told by
> > 
> > # ceph tell mds.mds3 damage ls
> > 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > [
> > {
> > "damage_type": "backtrace",
> > "id": 3760765989,
> > "ino": 1099518115802,
> > "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> > }
> > ]
> > 
> > starts a bit strange to me.
> > 
> > Are the snapshots also repaired with a recursive repair operation?
> > 
> > Thanks
> > Lars
> > 
> > 
> > Mon, 19 Aug 2019 13:30:53 +0200
> > Paul Emmerich  ==> Lars Täuber  :  
> > > Hi,
> > > 
> > > that error just says that the path is wrong. I unfortunately don't
> > > know the correct way to instruct it to scrub a stray path off the top
> > > of my head; you can always run a recursive scrub on / to go over
> > > everything, though
> > > 
> > > 
> > > Paul
> > > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com