[ceph-users] Why newly added OSD need to get all historical OSDMAPs in pre-boot

2020-08-01 Thread Xiaoxi Chen
Hi List,

 We see newly added OSD take long time to become active,  it is
fetching old osdmaps from monitors. Curious why it need that much
historical ?  The cluster is stable , no osd down/out except the ongoing
cap add (one osd at a time).


{
"cluster_fsid": "e959f744-64be-4f4e-9606-103ccaa9d2d6",
"osd_fsid": "0a2d4b9d-f6e5-4e40-be7f-16daee131a20",
"whoami": 38,
"state": "preboot",
"oldest_map": 193222,
"newest_map": 298982,
"num_pgs": 0
}

root@host:~# ceph osd getmap | head -5
got osdmap epoch 324825


-Xiaoxi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ANN] A framework for deploying Octopus using cephadm in the cloud

2020-08-01 Thread Patrick Donnelly
On Sat, Aug 1, 2020 at 5:50 AM Marc Roos  wrote:
> I can understand the benefits of having a CO, I am still testing with
> mesos. However what is the benefit of having ceph daemons running in CO
> environment?

As I said in my original post, I use this for testing CephFS.

There's no reason why you can't also use Ceph as a storage system in
the cloud though. It just may not be as cost effective as the native
cloud storage offering (like S3).

>Except for your mds, mrg and radosgw, your osd daemons are
> bound to the hardware / disks they are running on. It is not like if
> osd.121 goes down, you can start it on some random node.

Why not?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] which exact decimal value is meant here for S64_MIN in CRUSH Mapper

2020-08-01 Thread Bobby
Hi,

In *mapper.c*  file of Ceph CRUSH, I am trying to understand the definition
of a linux macro  ```*S64_MIN*``` used in the  following  ```*else*```
condition i.e. ```*draw = S64_MIN*```.

Which exact decimal value is meant here for ```*S64_MIN*```?

```
if (weights[i])
  {
u = hash(bucket->h.hash, x, ids[i], r);
u &= 0x;
ln = crush_ln(u) - 0x1ll;

__s64 draw = div64_s64(ln, weights[i]);
}
else
   {
__s64 draw = S64_MIN;

  // #define S64_MAX((s64)(U64_MAX >> 1))
  // #define S64_MIN ((s64)(-S64_MAX -1))
  }
if (i == 0 || draw > high_draw)

  {
high = i;
high_draw = draw;
 }
}
return bucket->h.items[high];
}


Thanks

Bobby !
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ANN] A framework for deploying Octopus using cephadm in the cloud

2020-08-01 Thread Marc Roos
 

I can understand the benefits of having a CO, I am still testing with 
mesos. However what is the benefit of having ceph daemons running in CO 
environment? Except for your mds, mrg and radosgw, your osd daemons are 
bound to the hardware / disks they are running on. It is not like if 
osd.121 goes down, you can start it on some random node.





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ANN] A framework for deploying Octopus using cephadm in the cloud

2020-08-01 Thread Marc Roos
 

>> Full disclosure: I have no relationship with Linode except as a 
customer.

So why mention them? I have often abuse coming from linode. So I rather 
see that linode has 0 clients.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mimic: much more raw used than reported

2020-08-01 Thread Igor Fedotov

Hi Frank,

On 7/31/2020 10:31 AM, Frank Schilder wrote:

Hi Igor,

thanks. I guess the problem with finding the corresponding images is, that it 
happens on bluestore and not on object level. Even if I listed all rados 
objects and added their sizes I would not see the excess storage.

Thinking about working around this issue, would re-writing the objects deflate 
the exces usage? For example, evacuating an OSD and adding it back to the pool 
after it was empty, would this re-write the objects on this OSD without the 
overhead?

May be but I can't say for sure..


Or simply copying an entire RBD image, would the copy be deflated?

Although the latter options sound a bit crazy, one could do this without (much) 
downtime of VMs and it might get us through this migration.


Also you might want to try pg export/import using ceph-objectstore-tool. 
See https://ceph.io/geen-categorie/incomplete-pgs-oh-my/ for some hints 
how to do that.


But again I'm not certain if it's helpful. Preferably to try with some 
non-production cluster first...




Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 30 July 2020 15:40
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Hi Frank,

On 7/30/2020 11:19 AM, Frank Schilder wrote:

Hi Igor,

thanks for looking at this. Here a few thoughts:

The copy goes to NTFS. I would expect between 2-4 meta data operations per 
write, which would go to few existing objects. I guess the difference 
bluestore_write_small-bluestore_write_small_new are mostly such writes and are 
susceptible to the partial overwrite amplification. A first question is, how 
many objects are actually affected? 3 small writes does not mean 3 
objects have partial overwrites.

The large number of small_new is indeed strange, although these would not lead 
to excess allocations. It is possible that the write size of the copy tool is 
not ideal, was wondering about this too. I will investigate.

small_new might relate to small tailing chunks that presumably appear
when doing unaligned appends. Each such append triggers small_new write...



To know more, I would need to find out which images these small writes come 
from, we have more than one active. Is there a low-level way to find out which 
objects are affected by partial overwrites and which image they belong to? In 
your post you were describing some properties like being shared/cloned etc. Can 
one search for such objects?

IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log
inspection is likely to be the only mean to learn which objects OSD is
processing... Be careful - this produces significant amount of data and
negatively impact the performance.

On a more fundamental level, I'm wondering why RBD images issue sub-object size 
writes at all. I naively assumed that every I/O operation to RBD always implies 
full object writes, even just changing a single byte (thinking of an object as 
the equivalent of a sector on a disk, the smallest atomic unit). If this is not 
the case, what is the meaning of object size then? How does it influence on I/O 
patterns? My benchmarks show that object size matters a lot, but it becomes a 
bit unclear now why.

Not sure I can provide good enough answer on the above. But I doubt that
RBD unconditionally operates on full objects.



Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 29 July 2020 16:25:36
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mimic: much more raw used than reported

Frank,

so you have pretty high amount of small writes indeed. More than a half
of the written volume (in bytes) is done via small writes.

And 6x times more small requests.


This looks pretty odd for sequential write pattern and is likely to be
the root cause for that space overhead.

I can see approx 1.4GB additionally lost per each of these 3 OSDs since
perf dump reset  ( = allocated_new - stored_new - (allocated_old -
stored_old))

Below are some speculations on what might be happening by for sure I
could be wrong/missing something. So please do not consider this as a
100% valid analysis.

Client does writes in 1MB chunks. This is split into 6 EC chunks (+2
added) which results in approx 170K writing block to object store ( =
1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing
one. Resulting in 3x64K allocations.

The next client adjacent write results in another 128K blob, one more
"small" tailing blob and heading blob which partially overlaps with the
previous tailing 42K chunk. Overlapped chunks are expected to be merged.
But presumably this doesn't happen due to that "partial EC overwrites"
issue. So instead additional 64K blob is allocated for overlapped range.

I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blo