[ceph-users] Re: poor cephFS performance on Nautilus 14.2.9 deployed by ceph_ansible

2020-06-09 Thread Marc Roos


Hi Derrick, 

I am not sure what this 200-300MB/s on hdd is. But it is probably not 
really relevant. I am testing native disk performance before I use them 
with ceph with this fio script. It is a bit lengthy, that is because I 
want to be able to have data for possible future use cases. 

Furthermore since I upgraded to Nautilus I have been having issues with 
the kernel mount cephfs on osd nodes and had to revert back to fuse. 
Even when having 88GB free memory.

https://tracker.ceph.com/issues/45663
https://tracker.ceph.com/issues/44100


[global]
ioengine=libaio
#ioengine=posixaio
invalidate=1
ramp_time=30
iodepth=1
runtime=180
time_based
direct=1
filename=/dev/sdX
#filename=/mnt/cephfs/ssd/fio-bench.img

[write-4k-seq]
stonewall
bs=4k
rw=write
#write_bw_log=sdx-4k-write-seq.results
#write_iops_log=sdx-4k-write-seq.results

[randwrite-4k-seq]
stonewall
bs=4k
rw=randwrite
#write_bw_log=sdx-4k-randwrite-seq.results
#write_iops_log=sdx-4k-randwrite-seq.results

[read-4k-seq]
stonewall
bs=4k
rw=read
#write_bw_log=sdx-4k-read-seq.results
#write_iops_log=sdx-4k-read-seq.results

[randread-4k-seq]
stonewall
bs=4k
rw=randread
#write_bw_log=sdx-4k-randread-seq.results
#write_iops_log=sdx-4k-randread-seq.results

[rw-4k-seq]
stonewall
bs=4k
rw=rw
#write_bw_log=sdx-4k-rw-seq.results
#write_iops_log=sdx-4k-rw-seq.results

[randrw-4k-seq]
stonewall
bs=4k
rw=randrw
#write_bw_log=sdx-4k-randrw-seq.results
#write_iops_log=sdx-4k-randrw-seq.results

[write-128k-seq]
stonewall
bs=128k
rw=write
#write_bw_log=sdx-128k-write-seq.results
#write_iops_log=sdx-128k-write-seq.results

[randwrite-128k-seq]
stonewall
bs=128k
rw=randwrite
#write_bw_log=sdx-128k-randwrite-seq.results
#write_iops_log=sdx-128k-randwrite-seq.results

[read-128k-seq]
stonewall
bs=128k
rw=read
#write_bw_log=sdx-128k-read-seq.results
#write_iops_log=sdx-128k-read-seq.results

[randread-128k-seq]
stonewall
bs=128k
rw=randread
#write_bw_log=sdx-128k-randread-seq.results
#write_iops_log=sdx-128k-randread-seq.results

[rw-128k-seq]
stonewall
bs=128k
rw=rw
#write_bw_log=sdx-128k-rw-seq.results
#write_iops_log=sdx-128k-rw-seq.results

[randrw-128k-seq]
stonewall
bs=128k
rw=randrw
#write_bw_log=sdx-128k-randrw-seq.results
#write_iops_log=sdx-128k-randrw-seq.results

[write-1024k-seq]
stonewall
bs=1024k
rw=write
#write_bw_log=sdx-1024k-write-seq.results
#write_iops_log=sdx-1024k-write-seq.results

[randwrite-1024k-seq]
stonewall
bs=1024k
rw=randwrite
#write_bw_log=sdx-1024k-randwrite-seq.results
#write_iops_log=sdx-1024k-randwrite-seq.results

[read-1024k-seq]
stonewall
bs=1024k
rw=read
#write_bw_log=sdx-1024k-read-seq.results
#write_iops_log=sdx-1024k-read-seq.results

[randread-1024k-seq]
stonewall
bs=1024k
rw=randread
#write_bw_log=sdx-1024k-randread-seq.results
#write_iops_log=sdx-1024k-randread-seq.results

[rw-1024k-seq]
stonewall
bs=1024k
rw=rw
#write_bw_log=sdx-1024k-rw-seq.results
#write_iops_log=sdx-1024k-rw-seq.results

[randrw-1024k-seq]
stonewall
bs=1024k
rw=randrw
#write_bw_log=sdx-1024k-randrw-seq.results
#write_iops_log=sdx-1024k-randrw-seq.results

[write-4096k-seq]
stonewall
bs=4096k
rw=write
#write_bw_log=sdx-4096k-write-seq.results
#write_iops_log=sdx-4096k-write-seq.results

[randwrite-4096k-seq]
stonewall
bs=4096k
rw=randwrite
#write_bw_log=sdx-4096k-randwrite-seq.results
#write_iops_log=sdx-4096k-randwrite-seq.results

[read-4096k-seq]
stonewall
bs=4096k
rw=read
#write_bw_log=sdx-4096k-read-seq.results
#write_iops_log=sdx-4096k-read-seq.results

[randread-4096k-seq]
stonewall
bs=4096k
rw=randread
#write_bw_log=sdx-4096k-randread-seq.results
#write_iops_log=sdx-4096k-randread-seq.results

[rw-4096k-seq]
stonewall
bs=4096k
rw=rw
#write_bw_log=sdx-4096k-rw-seq.results
#write_iops_log=sdx-4096k-rw-seq.results

[randrw-4096k-seq]
stonewall
bs=4096k
rw=randrw
#write_bw_log=sdx-4096k-randrw-seq.results
#write_iops_log=sdx-4096k-randrw-seq.results
 





-Original Message-
From: Derrick Lin [mailto:klin...@gmail.com] 
Sent: dinsdag 9 juni 2020 4:12
To: Mark Nelson
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: poor cephFS performance on Nautilus 14.2.9 
deployed by ceph_ansible

Thanks Mark & Marc

We will do more testing inc kernel client as well as testing the block 
storage performance first.

We just did some direct raw performance test on a single spinning disk 
(format as ext4) and it could delivery 200-300MB/s throughput in various 
writing and mix testings. But FUSE client could only give ~50MB/s.

Cheers,
D

On Thu, Jun 4, 2020 at 1:27 PM Mark Nelson  wrote:

> Try using the kernel client instead of the FUSE client.  The FUSE 
> client is known to be slow for a variety of reasons and I suspect you 
> may see faster performance with the kernel client.
>
>
> Thanks,
>
> Mark
>
>
> On 6/2/20 8:00 PM, Derrick Lin wrote:
> > Hi guys,
> >
> > We just deployed a CEPH 14.2.9 cluster with the following hardware:
> >
> > MDSS x 1
> > Xeon Gold 5122 3.6Ghz
> > 192GB
> > Mellanox ConnectX-4 Lx 25GbE
> >
> >
> > MON x 3
> > Xeon Bronze 31

[ceph-users] Re: rbd-mirror sync image continuously or only sync once

2020-06-09 Thread Zhenshi Zhou
What's more, I configured "rbd_journal_max_payload_bytes = 8388608" on
clusterA and
"rbd_mirror_journal_max_fetch_bytes = 33554432" on clusterB as well, with
restarting monitors
of clusterA and rbd-mirror on clusterB. Nothing changed, the target rbd is
still 11 minutes less
data than that of clusterA. I tried image mode and pool mode both.

Zhenshi Zhou  于2020年6月9日周二 上午11:41写道:

> I have just done a test on rbd-mirror. Follow the steps:
> 1. deploy two new clusters, clusterA and clusterB
> 2. configure one-way replication from clusterA to clusterB with rbd-mirror
> 3. write data to rbd_blk on clusterA once every 5 seconds
> 4. get information with 'rbd mirror image status rbd_blk', the "state" is
> "up+replaying"
> 5. demote image on clusterA(just wanna stop syncing and switch client
> connection to clusterB)
>
> The result is:
> 1. I find that the "last_update" from "rbd mirror image status" updates
> every 30 seconds, which means I will lose at most 30s of data
> 2. I stopped syncing on 11:02, while the data from rbd_blk on the clusterB
> is not newer than 10:50.
>
> Did I have the wrong steps in the switching progress?
>
>
>
> Zhenshi Zhou  于2020年6月9日周二 上午8:57写道:
>
>> Well, I'm afraid that the image didn't replay continuously, which means I
>> have some data lost.
>> The "rbd mirror image status" shows the image is replayed and its time is
>> just before I demote
>> the primary image. I lost about 24 hours' data and I'm not sure whether
>> there is an interval
>> between the synchronization.
>>
>> I use version 14.2.9 and I deployed a one direction mirror.
>>
>> Zhenshi Zhou  于2020年6月5日周五 上午10:22写道:
>>
>>> Thank you for the clarification. That's very clear.
>>>
>>> Jason Dillaman  于2020年6月5日周五 上午12:46写道:
>>>
 On Thu, Jun 4, 2020 at 3:43 AM Zhenshi Zhou 
 wrote:
 >
 > My condition is that the primary image being used while rbd-mirror
 sync.
 > I want to get the period between the two times of rbd-mirror transfer
 the
 > increased data.
 > I will search those options you provided, thanks a lot :)

 When using the original (pre-Octopus) journal-based mirroring, once
 the initial sync completes to transfer the bulk of the image data from
 a point-in-time dynamic snapshot, any changes post sync will be
 replayed continuously from the stream of events written to the journal
 on the primary image. The "rbd mirror image status" against the
 non-primary image will provide more details about the current state of
 the journal replay.

 With the Octopus release, we now also support snapshot-based mirroring
 where we transfer any image deltas between two mirroring snapshots.
 These mirroring snapshots are different from user-created snapshots
 and their life-time is managed by RBD mirroring (i.e. they are
 automatically pruned when no longer needed). This version of mirroring
 probably more closely relates to your line of questioning since the
 period of replication is at whatever period you create new mirroring
 snapshots (provided your two clusters can keep up).

 >
 > Eugen Block  于2020年6月4日周四 下午3:28写道:
 >
 > > The initial sync is a full image sync, the rest is based on the
 object
 > > sets created. There are several options to control the mirroring,
 for
 > > example:
 > >
 > > rbd_journal_max_concurrent_object_sets
 > > rbd_mirror_concurrent_image_syncs
 > > rbd_mirror_leader_max_missed_heartbeats
 > >
 > > and many more. I'm not sure I fully understand what you're asking,
 > > maybe you could rephrase your question?
 > >
 > >
 > > Zitat von Zhenshi Zhou :
 > >
 > > > Hi Eugen,
 > > >
 > > > Thanks for the reply. If rbd-mirror constantly synchronize
 changes,
 > > > what frequency to replay once? I don't find any options I can
 config.
 > > >
 > > > Eugen Block  于2020年6月4日周四 下午2:54写道:
 > > >
 > > >> Hi,
 > > >>
 > > >> that's the point of rbd-mirror, to constantly replay changes
 from the
 > > >> primary image to the remote image (if the rbd journal feature is
 > > >> enabled).
 > > >>
 > > >>
 > > >> Zitat von Zhenshi Zhou :
 > > >>
 > > >> > Hi all,
 > > >> >
 > > >> > I'm gonna deploy a rbd-mirror in order to sync image from
 clusterA to
 > > >> > clusterB.
 > > >> > The image will be used while syncing. I'm not sure if the
 rbd-mirror
 > > will
 > > >> > sync image
 > > >> > continuously or not. If not, I will inform clients not to
 write data
 > > in
 > > >> it.
 > > >> >
 > > >> > Thanks. Regards
 > > >> > ___
 > > >> > ceph-users mailing list -- ceph-users@ceph.io
 > > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
 > > >>
 > > >>
 > > >> ___
 > > >> ceph-users

[ceph-users] RadosGW latency on chuked uploads

2020-06-09 Thread Tadas
Hello,

I have strange issues with radosgw:
When trying to PUT object with “transfer-encoding: chunked”, I can see high 
request latencies.
When trying to PUT the same object as non-chunked – latency is much lower, and 
also request/s performance is better.
Perhaps anyone had the same issue?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rebalancing after modifying CRUSH map

2020-06-09 Thread Frank Schilder
This is done automatically. Every time the crush map changes, objects get moved 
around.

Therefore, a typical procedure is

- make sure ceph is HEALTH_OK
- ceph osd set noout
- ceph osd set norebalance
- edit crush map
- wait for peering to finish, all PGs must be active+clean
- lots of PGs will also be re-mapped
- ceph osd unset norebalance
- ceph osd unset noout

Before doing the last 2 steps, verify that no PGs are incomplete and no objects 
are degraded. Otherwise, fix first.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Brett Randall 
Sent: 09 June 2020 07:42:33
To: ceph-users@ceph.io
Subject: [ceph-users] Rebalancing after modifying CRUSH map

Hi all


We are looking at implementing Ceph/CephFS for a project. Over time, we may 
wish to add additional replicas to our cluster. If we modify a CRUSH map, is 
there a way of then requesting Ceph to re-evaluate the placement of objects 
across the cluster according to the modified CRUSH map?

Brett
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-mirror sync image continuously or only sync once

2020-06-09 Thread Jason Dillaman
On Mon, Jun 8, 2020 at 11:42 PM Zhenshi Zhou  wrote:
>
> I have just done a test on rbd-mirror. Follow the steps:
> 1. deploy two new clusters, clusterA and clusterB
> 2. configure one-way replication from clusterA to clusterB with rbd-mirror
> 3. write data to rbd_blk on clusterA once every 5 seconds
> 4. get information with 'rbd mirror image status rbd_blk', the "state" is 
> "up+replaying"
> 5. demote image on clusterA(just wanna stop syncing and switch client 
> connection to clusterB)
>
> The result is:
> 1. I find that the "last_update" from "rbd mirror image status" updates every 
> 30 seconds, which means I will lose at most 30s of data

The status updates are throttled but not the data transfer.

> 2. I stopped syncing on 11:02, while the data from rbd_blk on the clusterB is 
> not newer than 10:50.

After demotion, you need to promote on the original non-primary side
(w/o the --force) option. It won't let you non-force promote unless it
has all the data copied. It will continue to copy data while the other
side has been demoted until it fully catches up.

>
> Did I have the wrong steps in the switching progress?
>
>
>
> Zhenshi Zhou  于2020年6月9日周二 上午8:57写道:
>>
>> Well, I'm afraid that the image didn't replay continuously, which means I 
>> have some data lost.
>> The "rbd mirror image status" shows the image is replayed and its time is 
>> just before I demote
>> the primary image. I lost about 24 hours' data and I'm not sure whether 
>> there is an interval
>> between the synchronization.
>>
>> I use version 14.2.9 and I deployed a one direction mirror.
>>
>> Zhenshi Zhou  于2020年6月5日周五 上午10:22写道:
>>>
>>> Thank you for the clarification. That's very clear.
>>>
>>> Jason Dillaman  于2020年6月5日周五 上午12:46写道:

 On Thu, Jun 4, 2020 at 3:43 AM Zhenshi Zhou  wrote:
 >
 > My condition is that the primary image being used while rbd-mirror sync.
 > I want to get the period between the two times of rbd-mirror transfer the
 > increased data.
 > I will search those options you provided, thanks a lot :)

 When using the original (pre-Octopus) journal-based mirroring, once
 the initial sync completes to transfer the bulk of the image data from
 a point-in-time dynamic snapshot, any changes post sync will be
 replayed continuously from the stream of events written to the journal
 on the primary image. The "rbd mirror image status" against the
 non-primary image will provide more details about the current state of
 the journal replay.

 With the Octopus release, we now also support snapshot-based mirroring
 where we transfer any image deltas between two mirroring snapshots.
 These mirroring snapshots are different from user-created snapshots
 and their life-time is managed by RBD mirroring (i.e. they are
 automatically pruned when no longer needed). This version of mirroring
 probably more closely relates to your line of questioning since the
 period of replication is at whatever period you create new mirroring
 snapshots (provided your two clusters can keep up).

 >
 > Eugen Block  于2020年6月4日周四 下午3:28写道:
 >
 > > The initial sync is a full image sync, the rest is based on the object
 > > sets created. There are several options to control the mirroring, for
 > > example:
 > >
 > > rbd_journal_max_concurrent_object_sets
 > > rbd_mirror_concurrent_image_syncs
 > > rbd_mirror_leader_max_missed_heartbeats
 > >
 > > and many more. I'm not sure I fully understand what you're asking,
 > > maybe you could rephrase your question?
 > >
 > >
 > > Zitat von Zhenshi Zhou :
 > >
 > > > Hi Eugen,
 > > >
 > > > Thanks for the reply. If rbd-mirror constantly synchronize changes,
 > > > what frequency to replay once? I don't find any options I can config.
 > > >
 > > > Eugen Block  于2020年6月4日周四 下午2:54写道:
 > > >
 > > >> Hi,
 > > >>
 > > >> that's the point of rbd-mirror, to constantly replay changes from 
 > > >> the
 > > >> primary image to the remote image (if the rbd journal feature is
 > > >> enabled).
 > > >>
 > > >>
 > > >> Zitat von Zhenshi Zhou :
 > > >>
 > > >> > Hi all,
 > > >> >
 > > >> > I'm gonna deploy a rbd-mirror in order to sync image from 
 > > >> > clusterA to
 > > >> > clusterB.
 > > >> > The image will be used while syncing. I'm not sure if the 
 > > >> > rbd-mirror
 > > will
 > > >> > sync image
 > > >> > continuously or not. If not, I will inform clients not to write 
 > > >> > data
 > > in
 > > >> it.
 > > >> >
 > > >> > Thanks. Regards
 > > >> > ___
 > > >> > ceph-users mailing list -- ceph-users@ceph.io
 > > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
 > > >>
 > > >>
 > > >> ___
 > > >> 

[ceph-users] Re: rbd-mirror with snapshot, not doing any actaul data sync

2020-06-09 Thread Jason Dillaman
On Mon, Jun 8, 2020 at 6:18 PM Hans van den Bogert  wrote:
>
> Rather unsatisfactory to not know where it really went wrong, but completely 
> removing all traces of peer settings and auth keys, and I redid the peer 
> bootstrap and this did result in a working sync.
>
> My initial mirror config stemmed from Nautilus and was configged for 
> journaling on a pool. Perhaps transitioning to an image based snapshot config 
> has some problems? But that's just guessing.

Perhaps -- snapshot-based mirroring is not supported on a pool
configured for journal-based mirroring.

> Thanks for the follow-up though!
>
> Regards,
>
> Hans
>
> On Mon, Jun 8, 2020, 13:38 Jason Dillaman  wrote:
>>
>> On Sun, Jun 7, 2020 at 8:06 AM Hans van den Bogert  
>> wrote:
>> >
>> > Hi list,
>> >
>> > I've awaited octopus for a along time to be able to use mirror with
>> > snapshotting, since my setup does not allow for journal based
>> > mirroring. (K8s/Rook 1.3.x with ceph 15.2.2)
>> >
>> > However, I seem to be stuck, i've come to the point where on the
>> > cluster on which the (non-active) replicas should reside I get this:
>> >
>> > ```
>> > rbd mirror pool status -p replicapool --verbose
>> >
>> > ...
>> > pvc-f7ca0b55-ed38-4d9f-b306-7db6a0157e2e:
>> >   global_id:   d3a301f2-4f54-4e9e-b251-c55ddbb67dc6
>> >   state:   up+starting_replay
>> >   description: starting replay
>> >   service: a on nldw1-6-26-1
>> >   last_update: 2020-06-07 11:54:54
>> > ...
>> > ```
>> >
>> > That seems good, right? But I don't see any actual data being copied
>> > into the failover cluster.
>> >
>> > Anybody any ideas what to check?
>>
>> Can you look at the log files for "rbd-mirror" daemon? I wonder if it
>> starts and quickly fails.
>>
>> > Also, is it correct, you won't see mirror snapshots with the 'normal'
>> > `rbd snap` commands?
>>
>> Yes, "rbd snap ls" only shows user-created snapshots by default. You
>> can use "rbd snap ls --all" to see all snapshots on an image.
>>
>> > Thanks in advance,
>> >
>> > Hans
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>>
>>
>> --
>> Jason
>>


-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-mirror sync image continuously or only sync once

2020-06-09 Thread Zhenshi Zhou
I did promote the non-primary image, or I couldn't disable the image mirror.

Jason Dillaman  于2020年6月9日周二 下午7:19写道:

> On Mon, Jun 8, 2020 at 11:42 PM Zhenshi Zhou  wrote:
> >
> > I have just done a test on rbd-mirror. Follow the steps:
> > 1. deploy two new clusters, clusterA and clusterB
> > 2. configure one-way replication from clusterA to clusterB with
> rbd-mirror
> > 3. write data to rbd_blk on clusterA once every 5 seconds
> > 4. get information with 'rbd mirror image status rbd_blk', the "state"
> is "up+replaying"
> > 5. demote image on clusterA(just wanna stop syncing and switch client
> connection to clusterB)
> >
> > The result is:
> > 1. I find that the "last_update" from "rbd mirror image status" updates
> every 30 seconds, which means I will lose at most 30s of data
>
> The status updates are throttled but not the data transfer.
>
> > 2. I stopped syncing on 11:02, while the data from rbd_blk on the
> clusterB is not newer than 10:50.
>
> After demotion, you need to promote on the original non-primary side
> (w/o the --force) option. It won't let you non-force promote unless it
> has all the data copied. It will continue to copy data while the other
> side has been demoted until it fully catches up.
>
> >
> > Did I have the wrong steps in the switching progress?
> >
> >
> >
> > Zhenshi Zhou  于2020年6月9日周二 上午8:57写道:
> >>
> >> Well, I'm afraid that the image didn't replay continuously, which means
> I have some data lost.
> >> The "rbd mirror image status" shows the image is replayed and its time
> is just before I demote
> >> the primary image. I lost about 24 hours' data and I'm not sure whether
> there is an interval
> >> between the synchronization.
> >>
> >> I use version 14.2.9 and I deployed a one direction mirror.
> >>
> >> Zhenshi Zhou  于2020年6月5日周五 上午10:22写道:
> >>>
> >>> Thank you for the clarification. That's very clear.
> >>>
> >>> Jason Dillaman  于2020年6月5日周五 上午12:46写道:
> 
>  On Thu, Jun 4, 2020 at 3:43 AM Zhenshi Zhou 
> wrote:
>  >
>  > My condition is that the primary image being used while rbd-mirror
> sync.
>  > I want to get the period between the two times of rbd-mirror
> transfer the
>  > increased data.
>  > I will search those options you provided, thanks a lot :)
> 
>  When using the original (pre-Octopus) journal-based mirroring, once
>  the initial sync completes to transfer the bulk of the image data from
>  a point-in-time dynamic snapshot, any changes post sync will be
>  replayed continuously from the stream of events written to the journal
>  on the primary image. The "rbd mirror image status" against the
>  non-primary image will provide more details about the current state of
>  the journal replay.
> 
>  With the Octopus release, we now also support snapshot-based mirroring
>  where we transfer any image deltas between two mirroring snapshots.
>  These mirroring snapshots are different from user-created snapshots
>  and their life-time is managed by RBD mirroring (i.e. they are
>  automatically pruned when no longer needed). This version of mirroring
>  probably more closely relates to your line of questioning since the
>  period of replication is at whatever period you create new mirroring
>  snapshots (provided your two clusters can keep up).
> 
>  >
>  > Eugen Block  于2020年6月4日周四 下午3:28写道:
>  >
>  > > The initial sync is a full image sync, the rest is based on the
> object
>  > > sets created. There are several options to control the mirroring,
> for
>  > > example:
>  > >
>  > > rbd_journal_max_concurrent_object_sets
>  > > rbd_mirror_concurrent_image_syncs
>  > > rbd_mirror_leader_max_missed_heartbeats
>  > >
>  > > and many more. I'm not sure I fully understand what you're asking,
>  > > maybe you could rephrase your question?
>  > >
>  > >
>  > > Zitat von Zhenshi Zhou :
>  > >
>  > > > Hi Eugen,
>  > > >
>  > > > Thanks for the reply. If rbd-mirror constantly synchronize
> changes,
>  > > > what frequency to replay once? I don't find any options I can
> config.
>  > > >
>  > > > Eugen Block  于2020年6月4日周四 下午2:54写道:
>  > > >
>  > > >> Hi,
>  > > >>
>  > > >> that's the point of rbd-mirror, to constantly replay changes
> from the
>  > > >> primary image to the remote image (if the rbd journal feature
> is
>  > > >> enabled).
>  > > >>
>  > > >>
>  > > >> Zitat von Zhenshi Zhou :
>  > > >>
>  > > >> > Hi all,
>  > > >> >
>  > > >> > I'm gonna deploy a rbd-mirror in order to sync image from
> clusterA to
>  > > >> > clusterB.
>  > > >> > The image will be used while syncing. I'm not sure if the
> rbd-mirror
>  > > will
>  > > >> > sync image
>  > > >> > continuously or not. If not, I will inform clients not to
> write data
>  > > in
>  > > >> it.
>  > > >> >
>  > > >> > Thanks. Regar

[ceph-users] Re: rbd-mirror sync image continuously or only sync once

2020-06-09 Thread Jason Dillaman
On Tue, Jun 9, 2020 at 7:26 AM Zhenshi Zhou  wrote:
>
> I did promote the non-primary image, or I couldn't disable the image mirror.

OK, that means that 100% of the data was properly transferred since it
needs to replay previous events before it can get to the demotion
event, replay that, so that you could non-force promote. How are you
writing to the original primary image? Are you flushing your data?

> Jason Dillaman  于2020年6月9日周二 下午7:19写道:
>>
>> On Mon, Jun 8, 2020 at 11:42 PM Zhenshi Zhou  wrote:
>> >
>> > I have just done a test on rbd-mirror. Follow the steps:
>> > 1. deploy two new clusters, clusterA and clusterB
>> > 2. configure one-way replication from clusterA to clusterB with rbd-mirror
>> > 3. write data to rbd_blk on clusterA once every 5 seconds
>> > 4. get information with 'rbd mirror image status rbd_blk', the "state" is 
>> > "up+replaying"
>> > 5. demote image on clusterA(just wanna stop syncing and switch client 
>> > connection to clusterB)
>> >
>> > The result is:
>> > 1. I find that the "last_update" from "rbd mirror image status" updates 
>> > every 30 seconds, which means I will lose at most 30s of data
>>
>> The status updates are throttled but not the data transfer.
>>
>> > 2. I stopped syncing on 11:02, while the data from rbd_blk on the clusterB 
>> > is not newer than 10:50.
>>
>> After demotion, you need to promote on the original non-primary side
>> (w/o the --force) option. It won't let you non-force promote unless it
>> has all the data copied. It will continue to copy data while the other
>> side has been demoted until it fully catches up.
>>
>> >
>> > Did I have the wrong steps in the switching progress?
>> >
>> >
>> >
>> > Zhenshi Zhou  于2020年6月9日周二 上午8:57写道:
>> >>
>> >> Well, I'm afraid that the image didn't replay continuously, which means I 
>> >> have some data lost.
>> >> The "rbd mirror image status" shows the image is replayed and its time is 
>> >> just before I demote
>> >> the primary image. I lost about 24 hours' data and I'm not sure whether 
>> >> there is an interval
>> >> between the synchronization.
>> >>
>> >> I use version 14.2.9 and I deployed a one direction mirror.
>> >>
>> >> Zhenshi Zhou  于2020年6月5日周五 上午10:22写道:
>> >>>
>> >>> Thank you for the clarification. That's very clear.
>> >>>
>> >>> Jason Dillaman  于2020年6月5日周五 上午12:46写道:
>> 
>>  On Thu, Jun 4, 2020 at 3:43 AM Zhenshi Zhou  wrote:
>>  >
>>  > My condition is that the primary image being used while rbd-mirror 
>>  > sync.
>>  > I want to get the period between the two times of rbd-mirror transfer 
>>  > the
>>  > increased data.
>>  > I will search those options you provided, thanks a lot :)
>> 
>>  When using the original (pre-Octopus) journal-based mirroring, once
>>  the initial sync completes to transfer the bulk of the image data from
>>  a point-in-time dynamic snapshot, any changes post sync will be
>>  replayed continuously from the stream of events written to the journal
>>  on the primary image. The "rbd mirror image status" against the
>>  non-primary image will provide more details about the current state of
>>  the journal replay.
>> 
>>  With the Octopus release, we now also support snapshot-based mirroring
>>  where we transfer any image deltas between two mirroring snapshots.
>>  These mirroring snapshots are different from user-created snapshots
>>  and their life-time is managed by RBD mirroring (i.e. they are
>>  automatically pruned when no longer needed). This version of mirroring
>>  probably more closely relates to your line of questioning since the
>>  period of replication is at whatever period you create new mirroring
>>  snapshots (provided your two clusters can keep up).
>> 
>>  >
>>  > Eugen Block  于2020年6月4日周四 下午3:28写道:
>>  >
>>  > > The initial sync is a full image sync, the rest is based on the 
>>  > > object
>>  > > sets created. There are several options to control the mirroring, 
>>  > > for
>>  > > example:
>>  > >
>>  > > rbd_journal_max_concurrent_object_sets
>>  > > rbd_mirror_concurrent_image_syncs
>>  > > rbd_mirror_leader_max_missed_heartbeats
>>  > >
>>  > > and many more. I'm not sure I fully understand what you're asking,
>>  > > maybe you could rephrase your question?
>>  > >
>>  > >
>>  > > Zitat von Zhenshi Zhou :
>>  > >
>>  > > > Hi Eugen,
>>  > > >
>>  > > > Thanks for the reply. If rbd-mirror constantly synchronize 
>>  > > > changes,
>>  > > > what frequency to replay once? I don't find any options I can 
>>  > > > config.
>>  > > >
>>  > > > Eugen Block  于2020年6月4日周四 下午2:54写道:
>>  > > >
>>  > > >> Hi,
>>  > > >>
>>  > > >> that's the point of rbd-mirror, to constantly replay changes 
>>  > > >> from the
>>  > > >> primary image to the remote image (if the rbd journal feature is
>>  > > >> enable

[ceph-users] Re: rbd-mirror sync image continuously or only sync once

2020-06-09 Thread Zhenshi Zhou
It claimed error when I promoted the non-primary image at first. But the
command executed successfully after a while, without '--force'.

error:
"rbd: error promoting image to primary2020-06-09 19:56:30.662 7f27e17fa700
-1 librbd::mirror::PromoteRequest: 0x558fa971fd20 handle_get_info: image is
still primary within a remote cluster
2020-06-09 19:56:30.662 7f2804362b00 -1 librbd::api::Mirror: image_promote:
failed to promote image"

Besides that, I wrote data by adding `date` to the end of a file, and
flushed data by 'echo 3 > /proc/sys/vm/drop_caches'.

Jason Dillaman  于2020年6月9日周二 下午7:48写道:

> On Tue, Jun 9, 2020 at 7:26 AM Zhenshi Zhou  wrote:
> >
> > I did promote the non-primary image, or I couldn't disable the image
> mirror.
>
> OK, that means that 100% of the data was properly transferred since it
> needs to replay previous events before it can get to the demotion
> event, replay that, so that you could non-force promote. How are you
> writing to the original primary image? Are you flushing your data?
>
> > Jason Dillaman  于2020年6月9日周二 下午7:19写道:
> >>
> >> On Mon, Jun 8, 2020 at 11:42 PM Zhenshi Zhou 
> wrote:
> >> >
> >> > I have just done a test on rbd-mirror. Follow the steps:
> >> > 1. deploy two new clusters, clusterA and clusterB
> >> > 2. configure one-way replication from clusterA to clusterB with
> rbd-mirror
> >> > 3. write data to rbd_blk on clusterA once every 5 seconds
> >> > 4. get information with 'rbd mirror image status rbd_blk', the
> "state" is "up+replaying"
> >> > 5. demote image on clusterA(just wanna stop syncing and switch client
> connection to clusterB)
> >> >
> >> > The result is:
> >> > 1. I find that the "last_update" from "rbd mirror image status"
> updates every 30 seconds, which means I will lose at most 30s of data
> >>
> >> The status updates are throttled but not the data transfer.
> >>
> >> > 2. I stopped syncing on 11:02, while the data from rbd_blk on the
> clusterB is not newer than 10:50.
> >>
> >> After demotion, you need to promote on the original non-primary side
> >> (w/o the --force) option. It won't let you non-force promote unless it
> >> has all the data copied. It will continue to copy data while the other
> >> side has been demoted until it fully catches up.
> >>
> >> >
> >> > Did I have the wrong steps in the switching progress?
> >> >
> >> >
> >> >
> >> > Zhenshi Zhou  于2020年6月9日周二 上午8:57写道:
> >> >>
> >> >> Well, I'm afraid that the image didn't replay continuously, which
> means I have some data lost.
> >> >> The "rbd mirror image status" shows the image is replayed and its
> time is just before I demote
> >> >> the primary image. I lost about 24 hours' data and I'm not sure
> whether there is an interval
> >> >> between the synchronization.
> >> >>
> >> >> I use version 14.2.9 and I deployed a one direction mirror.
> >> >>
> >> >> Zhenshi Zhou  于2020年6月5日周五 上午10:22写道:
> >> >>>
> >> >>> Thank you for the clarification. That's very clear.
> >> >>>
> >> >>> Jason Dillaman  于2020年6月5日周五 上午12:46写道:
> >> 
> >>  On Thu, Jun 4, 2020 at 3:43 AM Zhenshi Zhou 
> wrote:
> >>  >
> >>  > My condition is that the primary image being used while
> rbd-mirror sync.
> >>  > I want to get the period between the two times of rbd-mirror
> transfer the
> >>  > increased data.
> >>  > I will search those options you provided, thanks a lot :)
> >> 
> >>  When using the original (pre-Octopus) journal-based mirroring, once
> >>  the initial sync completes to transfer the bulk of the image data
> from
> >>  a point-in-time dynamic snapshot, any changes post sync will be
> >>  replayed continuously from the stream of events written to the
> journal
> >>  on the primary image. The "rbd mirror image status" against the
> >>  non-primary image will provide more details about the current
> state of
> >>  the journal replay.
> >> 
> >>  With the Octopus release, we now also support snapshot-based
> mirroring
> >>  where we transfer any image deltas between two mirroring snapshots.
> >>  These mirroring snapshots are different from user-created snapshots
> >>  and their life-time is managed by RBD mirroring (i.e. they are
> >>  automatically pruned when no longer needed). This version of
> mirroring
> >>  probably more closely relates to your line of questioning since the
> >>  period of replication is at whatever period you create new
> mirroring
> >>  snapshots (provided your two clusters can keep up).
> >> 
> >>  >
> >>  > Eugen Block  于2020年6月4日周四 下午3:28写道:
> >>  >
> >>  > > The initial sync is a full image sync, the rest is based on
> the object
> >>  > > sets created. There are several options to control the
> mirroring, for
> >>  > > example:
> >>  > >
> >>  > > rbd_journal_max_concurrent_object_sets
> >>  > > rbd_mirror_concurrent_image_syncs
> >>  > > rbd_mirror_leader_max_missed_heartbeats
> >>  > >
> >>  > > and many more. I'

[ceph-users] IO500 Revised Call For Submissions Mid-2020 List

2020-06-09 Thread committee

New Deadline: 13 July 2020 AoE

NOTE: Given the short timeframe from the original announcement and 
complexities with some of the changes for this list, the deadline has 
been pushed out to give the community more time to participate. The BoF 
announcing the winners will be online 23 July 2020.


Announcement:
The IO500 [1] is now accepting and encouraging submissions for the
upcoming 6th IO500 list. Once again, we are also accepting submissions
to the 10 Node Challenge to encourage the submission of small scale
results. The new ranked lists will be announced via live-stream at a
virtual session. We hope to see many new results.

The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting so it is possible to submit on a small system and still get a
very good per-client score for example. Additionally, the list is about
much more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.

Following the success of the Top500 in collecting and analyzing
historical trends in supercomputer technology and evolution, the IO500
[1] was created in 2017, published its first list at SC17, and has grown
exponentially since then. The need for such an initiative has long been
known within High-Performance Computing; however, defining appropriate
benchmarks had long been challenging. Despite this challenge, the
community, after long and spirited discussion, finally reached consensus
on a suite of benchmarks and a metric for resolving the scores into a
single ranking.

The multi-fold goals of the benchmark suite are as follows:

* Maximizing simplicity in running the benchmark suite
* Encouraging optimization and documentation of tuning parameters 
for

performance
* Allowing submitters to highlight their "hero run" performance
numbers
* Forcing submitters to simultaneously report performance for
challenging IO patterns.

Specifically, the benchmark suite includes a hero-run of both IOR and
mdtest configured however possible to maximize performance and establish
an upper-bound for performance. It also includes an IOR and mdtest run
with highly constrained parameters forcing a difficult usage pattern in
an attempt to determine a lower-bound. Finally, it includes a namespace
search as this has been determined to be a highly sought-after feature
in HPC storage systems that has historically not been well-measured.
Submitters are encouraged to share their tuning insights for
publication.

The goals of the community are also multi-fold:

* Gather historical data for the sake of analysis and to aid
predictions of storage futures
* Collect tuning data to share valuable performance optimizations
across the community
* Encourage vendors and designers to optimize for workloads beyond
"hero runs"
* Establish bounded expectations for users, procurers, and
administrators

10 NODE I/O CHALLENGE

The 10 Node Challenge is conducted using the regular IO500 benchmark,
however, with the rule that exactly 10 client nodes must be used to run
the benchmark. You may use any shared storage with, e.g., any number of
servers. When submitting for the IO500 list, you can opt-in for
"Participate in the 10 compute node challenge only", then we will not
include the results into the ranked list. Other 10-node node submissions
will be included in the full list and in the ranked list. We will
announce the result in a separate derived list and in the full list but
not on the ranked IO500 list at https://io500.org/.

This information and rules for ISC20 submissions are available here:
https://www.vi4io.org/io500/rules/submission

Thanks,

The IO500 Committee

Links:
--
[1] http://io500.org/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Maximum size of data in crush_choose_firstn Ceph CRUSH source code

2020-06-09 Thread Bobby
Hi all,

I have a question regarding a function called *crush_choose_firstn* in Ceph
source code namely *mapper.c*  This function has following pointer
variables :

- const struct crush_map *map,
- struct crush_work *work,  const struct crush_bucket *bucket,
- int *out,
- const __u32 *weight,
- int *out2,
- const struct crush_choose_arg *choose_args
*- const struct crush_map *map.*

What is the maximum size of data involved here?  I mean what is the upper
bound?

BR
Bobby
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW latency on chuked uploads

2020-06-09 Thread Robin H. Johnson
On Tue, Jun 09, 2020 at 12:59:10PM +0300, Tadas wrote:
> Hello,
> 
> I have strange issues with radosgw:
> When trying to PUT object with “transfer-encoding: chunked”, I can see high 
> request latencies.
> When trying to PUT the same object as non-chunked – latency is much lower, 
> and also request/s performance is better.
> Perhaps anyone had the same issue?
What is your latency to the RGW?

There's one downside to chunked encoding that I observed with CivetWeb
when I implemented chunked transfer encoding for the Bucket Listing.

Specifically, CivetWeb did not stuff the socket with all available
content, and instead only trickled out entries, waiting for each TCP
window ACK before the next segment was sent.

If the Bucket Listing took a long time to complete within RGW, the
time to the first results was hugely improved, but the time for the full
response MAY be worse if the latency was high, due to having more back &
forth in TCP ACKs.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW latency on chuked uploads

2020-06-09 Thread Tadas

Hello,
we face like 75-100 ms while doing 600 chunked PUT's.
while 40-45ms while doing 1k normal PUT's.
(Even amount of PUT's lowers on chunked PUT way)

We tried civetweb and beast. Nothing changes.

-Original Message- 
From: Robin H. Johnson 
Sent: Tuesday, June 9, 2020 8:51 PM 
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: RadosGW latency on chuked uploads 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW latency on chuked uploads

2020-06-09 Thread Robin H. Johnson
On Tue, Jun 09, 2020 at 09:07:49PM +0300, Tadas wrote:
> Hello,
> we face like 75-100 ms while doing 600 chunked PUT's.
> while 40-45ms while doing 1k normal PUT's.
> (Even amount of PUT's lowers on chunked PUT way)
> 
> We tried civetweb and beast. Nothing changes.
How close is your test running to the RGWs?

Does it get noticeably worse if you inject artificial latency into the network?

E.g.
https://bencane.com/2012/07/16/tc-adding-simulated-network-latency-to-your-linux-server/

If you can run the test without SSL, then tcpdump should let you see if
your client is trying to stuff the max amount of data into the pipeline
or waiting for an ACK each time.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-09 Thread Francois Legrand

Hi,
Actually I let the mds managing the damaged filesystem as it is because 
the files can be read (despite of the warning and errors). Thus I 
restarted the rsyncs to transfer everything to the new filesystem (thus 
on different PG because it's a different cephfs with different pools) 
but without deleting the olds files to avoid killing definitively the 
old mds and the old fs. The number of segment is then more or less 
stable (very high ~123611 but not increasing too much).
I guess that we will have enought space to copy the remaining datas (it 
will be short but I think it will pass). Once everything will be 
transfered and checked, I will destroy the old FS and the damaged pool.

F.

Le 09/06/2020 à 19:50, Frank Schilder a écrit :

Looks like an answer to your other thread takes its time.

Is it a possible option for you to

- copy all readable files using this PG to some other storage,
- remove or clean up the broken PG and
- copy the files back in?

This might lead to a healthy cluster. I don't know a proper procedure though. 
Somehow the ceph fs must play along as files using this will also use other PGs 
and get partly broken.

Have you found other options?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct them (it's an erasure coding data pool)... so it cannot be
fixed anymore and we know we have data loss ! I did not marked the PG
out because there are still some parts (objects) which are still present
and we hope to be able to copy them and save a few bytes more ! It would
be great to be able to flush only broken objects, but I don't know how
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with
data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file"
to identify the ones which are really empty (we tested different way to
do this and it seems that's the fastest).
F.


Le 08/06/2020 à 16:07, Frank Schilder a écrit :

OK, now we are talking. It is very well possible that trimming will not start 
until this operation is completed.

If there are enough shards/copies to recover the lost objects, you should try a 
pg repair first. If you did loose too many replicas, there are ways to flush 
this PG out of the system. You will loose data this way. I don't know how to 
repair or flush only broken objects out of a PG, but would hope that this is 
possible.

Before you do anything destructive, open a new thread in this list specifically 
for how to repair/remove this PG with the least possible damage.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

There is no recovery going on, but indeed we have a pg damaged (with
some lost objects due to a major crash few weeks ago)... and there are
some shards of this pg on osd 27 !
That's also why we are migrating all the data out of this FS !
It's certainly related and I guess that  it's trying to remove some
datas that are already lost and it get stuck ! I don't know if there is
a way to tell ceph to forget about these ops ! I guess no.
I thus think that there is not that much to do apart from reading as
much data as we can to save as much as possible.
F.

Le 08/06/2020 à 15:48, Frank Schilder a écrit :

That's strange. Maybe there is another problem. Do you have any other health 
warnings that might be related? Is there some recovery/rebalancing going on?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :

Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of 
swap while trying to replay operations.

I would disconnect cleanly all clients if you didn't do so already, even any 
read-only clients. Any extra load will just slo

[ceph-users] Octopus OSDs dropping out of cluster: _check_auth_rotating possible clock skew, rotating keys expired way too early

2020-06-09 Thread Wido den Hollander
Hi,

On a recently deployed Octopus (15.2.2) cluster (240 OSDs) we are seeing
OSDs randomly drop out of the cluster.

Usually it's 2 to 4 OSDs spread out over different nodes. Each node has
16 OSDs and not all the failing OSDs are on the same node.

The OSDs are marked as down and all they keep print in their logs:

monclient: _check_auth_rotating possible clock skew, rotating keys
expired way too early (before 2020-06-04T07:57:17.706529-0400)

Looking at their status through the admin socket:

{
"cluster_fsid": "68653193-9b84-478d-bc39-1a811dd50836",
"osd_fsid": "87231b5d-ae5f-4901-93c5-18034381e5ec",
"whoami": 206,
"state": "active",
"oldest_map": 73697,
"newest_map": 75795,
"num_pgs": 19
}

The message brought me to my own ticket I created 2 years ago:
https://tracker.ceph.com/issues/23460

The first thing I've checked is NTP/time. Double, triple check this. All
the times are in sync on the cluster. Nothing wrong there.

Again, it's not all the OSDs on a node failing. Just 1 or 2 dropping out.

Restarting them brings them back right away and then within 24h some
other OSDs will drop out.

Has anybody seen this behavior with Octopus as well?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds behind on trimming - replay until memory exhausted

2020-06-09 Thread Frank Schilder
Looks like an answer to your other thread takes its time.

Is it a possible option for you to

- copy all readable files using this PG to some other storage,
- remove or clean up the broken PG and
- copy the files back in?

This might lead to a healthy cluster. I don't know a proper procedure though. 
Somehow the ceph fs must play along as files using this will also use other PGs 
and get partly broken.

Have you found other options?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Francois Legrand 
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
exhausted

I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct them (it's an erasure coding data pool)... so it cannot be
fixed anymore and we know we have data loss ! I did not marked the PG
out because there are still some parts (objects) which are still present
and we hope to be able to copy them and save a few bytes more ! It would
be great to be able to flush only broken objects, but I don't know how
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with
data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file"
to identify the ones which are really empty (we tested different way to
do this and it seems that's the fastest).
F.


Le 08/06/2020 à 16:07, Frank Schilder a écrit :
> OK, now we are talking. It is very well possible that trimming will not start 
> until this operation is completed.
>
> If there are enough shards/copies to recover the lost objects, you should try 
> a pg repair first. If you did loose too many replicas, there are ways to 
> flush this PG out of the system. You will loose data this way. I don't know 
> how to repair or flush only broken objects out of a PG, but would hope that 
> this is possible.
>
> Before you do anything destructive, open a new thread in this list 
> specifically for how to repair/remove this PG with the least possible damage.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Francois Legrand 
> Sent: 08 June 2020 16:00:28
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
> exhausted
>
> There is no recovery going on, but indeed we have a pg damaged (with
> some lost objects due to a major crash few weeks ago)... and there are
> some shards of this pg on osd 27 !
> That's also why we are migrating all the data out of this FS !
> It's certainly related and I guess that  it's trying to remove some
> datas that are already lost and it get stuck ! I don't know if there is
> a way to tell ceph to forget about these ops ! I guess no.
> I thus think that there is not that much to do apart from reading as
> much data as we can to save as much as possible.
> F.
>
> Le 08/06/2020 à 15:48, Frank Schilder a écrit :
>> That's strange. Maybe there is another problem. Do you have any other health 
>> warnings that might be related? Is there some recovery/rebalancing going on?
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Francois Legrand 
>> Sent: 08 June 2020 15:27:59
>> To: Frank Schilder; ceph-users
>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory 
>> exhausted
>>
>> Thanks again for the hint !
>> Indeed, I did a
>> ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
>> and it seems that osd 27 is more or less stuck with op of age 34987.5
>> (while others osd have ages < 1).
>> I tryed a ceph osd down 27 which resulted in reseting the age but I can
>> notice that age for osd.27 ops is rising again.
>> I think I will restart it (btw our osd servers and mds are different
>> machines).
>> F.
>>
>> Le 08/06/2020 à 15:01, Frank Schilder a écrit :
>>> Hi Francois,
>>>
>>> this sounds great. At least its operational. I guess it is still using a 
>>> lot of swap while trying to replay operations.
>>>
>>> I would disconnect cleanly all clients if you didn't do so already, even 
>>> any read-only clients. Any extra load will just slow down recovery. My best 
>>> guess is, that the MDS is replaying some operations, which is very slow due 
>>> to swap. While doing so, the segments to trim will probably keep increasing 
>>> for a while until it can start trimming.
>>>
>>> The slow meta-data IO is an operation hanging in some OSD. You should check 
>>> which OSD it is (ceph health detail) and check if you can see the operation 
>>> in the OSDs OPS queue. I would expect this OSD to have a really long OPS 
>>> queue. I have seen meta-data operations hang for a long time. In case this 
>>> OSD runs on th

[ceph-users] Re: Rebalancing after modifying CRUSH map

2020-06-09 Thread Brett Randall
Thanks Janne, this is fantastic to know.

Brett

--- original message ---
On June 9, 2020, 4:21 PM GMT+10 icepic...@gmail.com wrote:


Den tis 9 juni 2020 kl 07:43 skrev Brett Randall :


>> Hi all

>> We are looking at implementing Ceph/CephFS for a project. Over time, we may 
>> wish to add additional replicas to our cluster. If we modify a CRUSH map, is 
>> there a way of then requesting Ceph to re-evaluate the placement of objects 
>> across the cluster according to the modified CRUSH map?





If you edit the crush map (including just adding a disk or a new host) then all 
the pools whose crushrules are affected by the change will more or less 
immediately try to move around data in order to fit the new crush map. If you 
ask the cluster to move in some impossible way, it will refuse and claim the 
PGs are misplaced/remapped but will still continue to serve data as usual, 
until you either make it possible to place after the new rules, or you revert 
the new rules.



-- 

May the most significant bit of your life be positive.
--- end of original message ---
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDB Performance / High IOWaits.

2020-06-09 Thread Eugen Block

Hi,

IIUC those are standalone OSDs (no separate rocksDB on faster devices,  
I assume).
Have you checked OSD saturation (e.g. iostat) on the nodes? Although  
I'd expect "slow requests" in the logs if the disks were saturated but  
that would be my first guess.



Zitat von jameslip...@protonmail.com:


Greetings,

I'm using ceph (14.2.2); in conjunction with Proxmox. Currently I'm  
just doing tests and ran into an issue relating to high I/O waits.  
Just to give a little bit of a background, specifically relating to  
my current ceph configurations; we have 6 nodes, each consisting of  
2 osds (each node has 2x Intel SSDSC2KG019T8) OSD Type is bluestore.  
Global configurations (at least as shown on the proxmox interface)  
is as follows:


[global]
 auth_client_required = 
 auth_cluster_required = 
 auth_service_required = 
 cluster_network = 10.125.0.0/24
 fsid = f64d2a67-98c3-4dbc-abfd-906ea7aaf314
 mon_allow_pool_delete = true
	 mon_host = 10.125.0.101 10.125.0.102 10.125.0.103 10.125.0.105  
10.125.0.106 10.125.0.104

 osd_pool_default_min_size = 2
 osd_pool_default_size = 3
 public_network = 10.125.0.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring



If I'm missing any relevant information relating to my ceph setup  
(I'm still learning this), please let me know.


Each node consists of 2x Xeon E5-2660 v3. Where I ran into high I/O  
waits is when running a VM. VM is a mysql replication server (using  
8 cores), and is performing mostly writes. When trying to narrow  
down, it was pointing to disk writes. The only thing I'm seeing in  
the ceph logs are the following:


2020-06-08 02:43:01.062082 mgr.node01 (mgr.2914449) 8009571 :  
cluster [DBG] pgmap v8009574: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 2.4 MiB/s wr, 274 op/s
2020-06-08 02:43:03.063137 mgr.node01 (mgr.2914449) 8009572 :  
cluster [DBG] pgmap v8009575: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 3.0 MiB/s wr,  
380 op/s
2020-06-08 02:43:05.064125 mgr.node01 (mgr.2914449) 8009573 :  
cluster [DBG] pgmap v8009576: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.9 MiB/s wr,  
332 op/s
2020-06-08 02:43:07.065373 mgr.node01 (mgr.2914449) 8009574 :  
cluster [DBG] pgmap v8009577: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.7 MiB/s wr,  
313 op/s
2020-06-08 02:43:09.066210 mgr.node01 (mgr.2914449) 8009575 :  
cluster [DBG] pgmap v8009578: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 2.9 MiB/s wr,  
350 op/s
2020-06-08 02:43:11.066913 mgr.node01 (mgr.2914449) 8009576 :  
cluster [DBG] pgmap v8009579: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.1 MiB/s wr,  
346 op/s
2020-06-08 02:43:13.067926 mgr.node01 (mgr.2914449) 8009577 :  
cluster [DBG] pgmap v8009580: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.5 MiB/s wr,  
408 op/s
2020-06-08 02:43:15.068834 mgr.node01 (mgr.2914449) 8009578 :  
cluster [DBG] pgmap v8009581: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.0 MiB/s wr,  
320 op/s
2020-06-08 02:43:17.069627 mgr.node01 (mgr.2914449) 8009579 :  
cluster [DBG] pgmap v8009582: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 2.5 MiB/s wr,  
285 op/s
2020-06-08 02:43:19.070507 mgr.node01 (mgr.2914449) 8009580 :  
cluster [DBG] pgmap v8009583: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.0 MiB/s wr,  
349 op/s
2020-06-08 02:43:21.071241 mgr.node01 (mgr.2914449) 8009581 :  
cluster [DBG] pgmap v8009584: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.8 MiB/s wr,  
319 op/s
2020-06-08 02:43:23.072286 mgr.node01 (mgr.2914449) 8009582 :  
cluster [DBG] pgmap v8009585: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 2.7 MiB/s wr, 329 op/s
2020-06-08 02:43:25.073369 mgr.node01 (mgr.2914449) 8009583 :  
cluster [DBG] pgmap v8009586: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 2.8 MiB/s wr, 304 op/s
2020-06-08 02:43:27.074315 mgr.node01 (mgr.2914449) 8009584 :  
cluster [DBG] pgmap v8009587: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 2.2 MiB/s wr, 262 op/s
2020-06-08 02:43:29.075284 mgr.node01 (mgr.2914449) 8009585 :  
cluster [DBG] pgmap v8009588: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd, 2.9 MiB/s wr,  
342 op/s
2020-06-08 02:43:31.076180 mgr.node01 (mgr.2914449) 8009586 :  
cluster [DBG] pgmap v8009589: 512 pgs: 512 active+clean; 246 GiB  
data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd,