Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Stefan Hajnoczi
On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
> Example: Backup from ceph disk (rbd_cache=false) to local disk:
> 
> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
> 
> Then the backup job starts to read 64K blocks from ceph.
> 
> But ceph always reads 4M block, so this is incredibly slow and produces
> way too much network traffic.
> 
> Why does backup_calculate_cluster_size does not consider the block size from
> the source disk? 
> 
> cluster_size = MAX(block_size_source, block_size_target)

CCing block maintainers so they see your email and you get a response
more quickly.

Stefan


signature.asc
Description: PGP signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 09:32, Stefan Hajnoczi wrote:
> On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
>> Example: Backup from ceph disk (rbd_cache=false) to local disk:
>>
>> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
>>
>> Then the backup job starts to read 64K blocks from ceph.
>>
>> But ceph always reads 4M block, so this is incredibly slow and produces
>> way too much network traffic.
>>
>> Why does backup_calculate_cluster_size does not consider the block size from
>> the source disk? 
>>
>> cluster_size = MAX(block_size_source, block_size_target)

So Ceph always transmits 4 MB over the network, no matter what is
actually needed?  That sounds, well, interesting.

backup_calculate_cluster_size() doesn’t consider the source size because
to my knowledge there is no other medium that behaves this way.  So I
suppose the assumption was always that the block size of the source
doesn’t matter, because a partial read is always possible (without
having to read everything).


What would make sense to me is to increase the buffer size in general.
I don’t think we need to copy clusters at a time, and
0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
backup writes that are triggered by guest writes.  We haven’t yet
increased the copy size for background writes, though.  We can do that,
of course.  (And probably should.)

The thing is, it just seems unnecessary to me to take the source cluster
size into account in general.  It seems weird that a medium only allows
4 MB reads, because, well, guests aren’t going to take that into account.

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Wolfgang Bumiller
On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote:
> On 06.11.19 09:32, Stefan Hajnoczi wrote:
> > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
> >> Example: Backup from ceph disk (rbd_cache=false) to local disk:
> >>
> >> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
> >>
> >> Then the backup job starts to read 64K blocks from ceph.
> >>
> >> But ceph always reads 4M block, so this is incredibly slow and produces
> >> way too much network traffic.
> >>
> >> Why does backup_calculate_cluster_size does not consider the block size 
> >> from
> >> the source disk? 
> >>
> >> cluster_size = MAX(block_size_source, block_size_target)
> 
> So Ceph always transmits 4 MB over the network, no matter what is
> actually needed?  That sounds, well, interesting.

Or at least it generates that much I/O - in the end, it can slow down
the backup by up to a multi-digit factor...

> backup_calculate_cluster_size() doesn’t consider the source size because
> to my knowledge there is no other medium that behaves this way.  So I
> suppose the assumption was always that the block size of the source
> doesn’t matter, because a partial read is always possible (without
> having to read everything).

Unless you enable qemu-side caching this only works until the
block/cluster size of the source exceeds the one of the target.

> What would make sense to me is to increase the buffer size in general.
> I don’t think we need to copy clusters at a time, and
> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
> backup writes that are triggered by guest writes.  We haven’t yet
> increased the copy size for background writes, though.  We can do that,
> of course.  (And probably should.)
> 
> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general.  It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.

But guests usually have a page cache, which is why in many setups qemu
(and thereby the backup process) often doesn't.




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 11:18, Dietmar Maurer wrote:
>> The thing is, it just seems unnecessary to me to take the source cluster
>> size into account in general.  It seems weird that a medium only allows
>> 4 MB reads, because, well, guests aren’t going to take that into account.
> 
> Maybe it is strange, but it is quite obvious that there is an optimal cluster
> size for each storage type (4M in case of ceph)...

Sure, but usually one can always read sub-cluster ranges; at least, if
the cluster size is larger than 4 kB.  (For example, it’s perfectly fine
to read any bit of data from a qcow2 file with whatever cluster size it
has.  The same applies to filesystems.  The only limitation is what the
storage itself allows (with O_DIRECT), but that alignment is generally
not greater than 4 kB.)

As I said, I wonder how that even works when you attach such a volume to
a VM and let the guest read from it.  Surely it won’t issue just 4 MB
requests, so the network overhead must be tremendous?

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 11:34, Wolfgang Bumiller wrote:
> On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote:
>> On 06.11.19 09:32, Stefan Hajnoczi wrote:
>>> On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
 Example: Backup from ceph disk (rbd_cache=false) to local disk:

 backup_calculate_cluster_size returns 64K (correct for my local .raw image)

 Then the backup job starts to read 64K blocks from ceph.

 But ceph always reads 4M block, so this is incredibly slow and produces
 way too much network traffic.

 Why does backup_calculate_cluster_size does not consider the block size 
 from
 the source disk? 

 cluster_size = MAX(block_size_source, block_size_target)
>>
>> So Ceph always transmits 4 MB over the network, no matter what is
>> actually needed?  That sounds, well, interesting.
> 
> Or at least it generates that much I/O - in the end, it can slow down
> the backup by up to a multi-digit factor...

Oh, so I understand ceph internally resolves the 4 MB block and then
transmits the subcluster range.  That makes sense.

>> backup_calculate_cluster_size() doesn’t consider the source size because
>> to my knowledge there is no other medium that behaves this way.  So I
>> suppose the assumption was always that the block size of the source
>> doesn’t matter, because a partial read is always possible (without
>> having to read everything).
> 
> Unless you enable qemu-side caching this only works until the
> block/cluster size of the source exceeds the one of the target.
> 
>> What would make sense to me is to increase the buffer size in general.
>> I don’t think we need to copy clusters at a time, and
>> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
>> backup writes that are triggered by guest writes.  We haven’t yet
>> increased the copy size for background writes, though.  We can do that,
>> of course.  (And probably should.)
>>
>> The thing is, it just seems unnecessary to me to take the source cluster
>> size into account in general.  It seems weird that a medium only allows
>> 4 MB reads, because, well, guests aren’t going to take that into account.
> 
> But guests usually have a page cache, which is why in many setups qemu
> (and thereby the backup process) often doesn't.

But this still doesn’t make sense to me.  Linux doesn’t issue 4 MB
requests to pre-fill the page cache, does it?

And if it issues a smaller request, there is no way for a guest device
to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
around it, maybe you’d like to take that as well...?”

I understand wanting to increase the backup buffer size, but I don’t
quite understand why we’d want it to increase to the source cluster size
when the guest also has no idea what the source cluster size is.

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 12:18, Dietmar Maurer wrote:
>> And if it issues a smaller request, there is no way for a guest device
>> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
>> around it, maybe you’d like to take that as well...?”
>>
>> I understand wanting to increase the backup buffer size, but I don’t
>> quite understand why we’d want it to increase to the source cluster size
>> when the guest also has no idea what the source cluster size is.
> 
> Because it is more efficent.

For rbd.

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 12:22, Max Reitz wrote:
> On 06.11.19 12:18, Dietmar Maurer wrote:
>>> And if it issues a smaller request, there is no way for a guest device
>>> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
>>> around it, maybe you’d like to take that as well...?”
>>>
>>> I understand wanting to increase the backup buffer size, but I don’t
>>> quite understand why we’d want it to increase to the source cluster size
>>> when the guest also has no idea what the source cluster size is.
>>
>> Because it is more efficent.
> 
> For rbd.

Let me elaborate: Yes, a cluster size generally means that it is most
“efficient” to access the storage at that size.  But there’s a tradeoff.
 At some point, reading the data takes sufficiently long that reading a
bit of metadata doesn’t matter anymore (usually, that is).

There is a bit of a problem with making the backup copy size rather
large, and that is the fact that backup’s copy-before-write causes guest
writes to stall.  So if the guest just writes a bit of data, a 4 MB
buffer size may mean that in the background it will have to wait for 4
MB of data to be copied.[1]

Hm.  OTOH, we have the same problem already with the target’s cluster
size, which can of course be 4 MB as well.  But I can imagine it to
actually be important for the target, because otherwise there might be
read-modify-write cycles.

But for the source, I still don’t quite understand why rbd has such a
problem with small read requests.  I don’t doubt that it has (as you
explained), but again, how is it then even possible to use rbd as the
backend for a guest that has no idea of this requirement?  Does Linux
really prefill the page cache with 4 MB of data for each read?

Max


[1] I suppose what we could do is decouple the copy buffer size from the
bitmap granularity, but that would be more work than just a MAX() in
backup_calculate_cluster_size().



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 14:09, Dietmar Maurer wrote:
>> Let me elaborate: Yes, a cluster size generally means that it is most
>> “efficient” to access the storage at that size.  But there’s a tradeoff.
>>  At some point, reading the data takes sufficiently long that reading a
>> bit of metadata doesn’t matter anymore (usually, that is).
> 
> Any network storage suffers from long network latencies, so it always
> matters if you do more IOs than necessary.

Yes, exactly, that’s why I’m saying it makes sense to me to increase the
buffer size from the measly 64 kB that we currently have.  I just don’t
see the point of increasing it exactly to the source cluster size.

>> There is a bit of a problem with making the backup copy size rather
>> large, and that is the fact that backup’s copy-before-write causes guest
>> writes to stall. So if the guest just writes a bit of data, a 4 MB
>> buffer size may mean that in the background it will have to wait for 4
>> MB of data to be copied.[1]
> 
> We use this for several years now in production, and it is not a problem.
> (Ceph storage is mostly on 10G (or faster) network equipment).

So you mean for cases where backup already chooses a 4 MB buffer size
because the target has that cluster size?

>> Hm.  OTOH, we have the same problem already with the target’s cluster
>> size, which can of course be 4 MB as well.  But I can imagine it to
>> actually be important for the target, because otherwise there might be
>> read-modify-write cycles.
>>
>> But for the source, I still don’t quite understand why rbd has such a
>> problem with small read requests.  I don’t doubt that it has (as you
>> explained), but again, how is it then even possible to use rbd as the
>> backend for a guest that has no idea of this requirement?  Does Linux
>> really prefill the page cache with 4 MB of data for each read?
> 
> No idea. I just observed that upstream qemu backups with ceph are 
> quite unusable this way.

Hm, OK.

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer


> On 6 November 2019 14:17 Max Reitz  wrote:
> 
>  
> On 06.11.19 14:09, Dietmar Maurer wrote:
> >> Let me elaborate: Yes, a cluster size generally means that it is most
> >> “efficient” to access the storage at that size.  But there’s a tradeoff.
> >>  At some point, reading the data takes sufficiently long that reading a
> >> bit of metadata doesn’t matter anymore (usually, that is).
> > 
> > Any network storage suffers from long network latencies, so it always
> > matters if you do more IOs than necessary.
> 
> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
> buffer size from the measly 64 kB that we currently have.  I just don’t
> see the point of increasing it exactly to the source cluster size.
> 
> >> There is a bit of a problem with making the backup copy size rather
> >> large, and that is the fact that backup’s copy-before-write causes guest
> >> writes to stall. So if the guest just writes a bit of data, a 4 MB
> >> buffer size may mean that in the background it will have to wait for 4
> >> MB of data to be copied.[1]
> > 
> > We use this for several years now in production, and it is not a problem.
> > (Ceph storage is mostly on 10G (or faster) network equipment).
> 
> So you mean for cases where backup already chooses a 4 MB buffer size
> because the target has that cluster size?

To make it clear. Backups from Ceph as source are slow.

That is why we use a patched qemu version, which uses:

cluster_size = Max_Block_Size(source, target)

(I guess this only triggers for ceph)




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general.  It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.

Maybe it is strange, but it is quite obvious that there is an optimal cluster
size for each storage type (4M in case of ceph)...




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> And if it issues a smaller request, there is no way for a guest device
> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
> around it, maybe you’d like to take that as well...?”
> 
> I understand wanting to increase the backup buffer size, but I don’t
> quite understand why we’d want it to increase to the source cluster size
> when the guest also has no idea what the source cluster size is.

Because it is more efficent.




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> Let me elaborate: Yes, a cluster size generally means that it is most
> “efficient” to access the storage at that size.  But there’s a tradeoff.
>  At some point, reading the data takes sufficiently long that reading a
> bit of metadata doesn’t matter anymore (usually, that is).

Any network storage suffers from long network latencies, so it always
matters if you do more IOs than necessary.

> There is a bit of a problem with making the backup copy size rather
> large, and that is the fact that backup’s copy-before-write causes guest
> writes to stall. So if the guest just writes a bit of data, a 4 MB
> buffer size may mean that in the background it will have to wait for 4
> MB of data to be copied.[1]

We use this for several years now in production, and it is not a problem.
(Ceph storage is mostly on 10G (or faster) network equipment).

> Hm.  OTOH, we have the same problem already with the target’s cluster
> size, which can of course be 4 MB as well.  But I can imagine it to
> actually be important for the target, because otherwise there might be
> read-modify-write cycles.
> 
> But for the source, I still don’t quite understand why rbd has such a
> problem with small read requests.  I don’t doubt that it has (as you
> explained), but again, how is it then even possible to use rbd as the
> backend for a guest that has no idea of this requirement?  Does Linux
> really prefill the page cache with 4 MB of data for each read?

No idea. I just observed that upstream qemu backups with ceph are 
quite unusable this way.




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Max Reitz
On 06.11.19 14:34, Dietmar Maurer wrote:
> 
>> On 6 November 2019 14:17 Max Reitz  wrote:
>>
>>  
>> On 06.11.19 14:09, Dietmar Maurer wrote:
 Let me elaborate: Yes, a cluster size generally means that it is most
 “efficient” to access the storage at that size.  But there’s a tradeoff.
  At some point, reading the data takes sufficiently long that reading a
 bit of metadata doesn’t matter anymore (usually, that is).
>>>
>>> Any network storage suffers from long network latencies, so it always
>>> matters if you do more IOs than necessary.
>>
>> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
>> buffer size from the measly 64 kB that we currently have.  I just don’t
>> see the point of increasing it exactly to the source cluster size.
>>
 There is a bit of a problem with making the backup copy size rather
 large, and that is the fact that backup’s copy-before-write causes guest
 writes to stall. So if the guest just writes a bit of data, a 4 MB
 buffer size may mean that in the background it will have to wait for 4
 MB of data to be copied.[1]
>>>
>>> We use this for several years now in production, and it is not a problem.
>>> (Ceph storage is mostly on 10G (or faster) network equipment).
>>
>> So you mean for cases where backup already chooses a 4 MB buffer size
>> because the target has that cluster size?
> 
> To make it clear. Backups from Ceph as source are slow.

Yep, but if the target would be another ceph instance, the backup buffer
size would be chosen to be 4 MB (AFAIU), so I was wondering whether you
are referring to this effect, or to...

> That is why we use a patched qemu version, which uses:
> 
> cluster_size = Max_Block_Size(source, target)

...this.

The main problem with the stall I mentioned is that I think one of the
main use cases of backup is having a fast source and a slow (off-site)
target.  In such cases, I suppose it becomes annoying if some guest
writes (which were fast before the backup started) take a long time
because the backup needs to copy quite a bit of data to off-site storage.

(And blindly taking the source cluster size would mean that such things
could happen if you use local qcow2 files with 2 MB clusters.)


So I’d prefer decoupling the backup buffer size and the bitmap
granularity, and then set the buffer size to maybe the MAX of source and
target cluster sizes.  But I don’t know when I can get around to do that.

And then probably also cap it at 4 MB or 8 MB, because that happens to
be what you need, but I’d prefer for it not to use tons of memory.  (The
mirror job uses 1 MB per request, for up to 16 parallel requests; and
the backup copy-before-write implementation currently (on master) copies
1 MB at a time (per concurrent request), and the whole memory usage of
backup is limited at 128 MB.)

(OTOH, the minimum should probably be 1 MB.)

Max



signature.asc
Description: OpenPGP digital signature


Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Vladimir Sementsov-Ogievskiy
06.11.2019 16:52, Max Reitz wrote:
> On 06.11.19 14:34, Dietmar Maurer wrote:
>>
>>> On 6 November 2019 14:17 Max Reitz  wrote:
>>>
>>>   
>>> On 06.11.19 14:09, Dietmar Maurer wrote:
> Let me elaborate: Yes, a cluster size generally means that it is most
> “efficient” to access the storage at that size.  But there’s a tradeoff.
>   At some point, reading the data takes sufficiently long that reading a
> bit of metadata doesn’t matter anymore (usually, that is).

 Any network storage suffers from long network latencies, so it always
 matters if you do more IOs than necessary.
>>>
>>> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
>>> buffer size from the measly 64 kB that we currently have.  I just don’t
>>> see the point of increasing it exactly to the source cluster size.
>>>
> There is a bit of a problem with making the backup copy size rather
> large, and that is the fact that backup’s copy-before-write causes guest
> writes to stall. So if the guest just writes a bit of data, a 4 MB
> buffer size may mean that in the background it will have to wait for 4
> MB of data to be copied.[1]

 We use this for several years now in production, and it is not a problem.
 (Ceph storage is mostly on 10G (or faster) network equipment).
>>>
>>> So you mean for cases where backup already chooses a 4 MB buffer size
>>> because the target has that cluster size?
>>
>> To make it clear. Backups from Ceph as source are slow.
> 
> Yep, but if the target would be another ceph instance, the backup buffer
> size would be chosen to be 4 MB (AFAIU), so I was wondering whether you
> are referring to this effect, or to...
> 
>> That is why we use a patched qemu version, which uses:
>>
>> cluster_size = Max_Block_Size(source, target)
> 
> ...this.
> 
> The main problem with the stall I mentioned is that I think one of the
> main use cases of backup is having a fast source and a slow (off-site)
> target.  In such cases, I suppose it becomes annoying if some guest
> writes (which were fast before the backup started) take a long time
> because the backup needs to copy quite a bit of data to off-site storage.
> 
> (And blindly taking the source cluster size would mean that such things
> could happen if you use local qcow2 files with 2 MB clusters.)
> 
> 
> So I’d prefer decoupling the backup buffer size and the bitmap
> granularity, and then set the buffer size to maybe the MAX of source and
> target cluster sizes.  But I don’t know when I can get around to do that.

Note, that problem is not only in copy-before-write operations: if we have big
in-flight backup request from backup job itself, all new upcoming guest writes
to this area will have to wait.

> 
> And then probably also cap it at 4 MB or 8 MB, because that happens to
> be what you need, but I’d prefer for it not to use tons of memory.  (The
> mirror job uses 1 MB per request, for up to 16 parallel requests; and
> the backup copy-before-write implementation currently (on master) copies
> 1 MB at a time (per concurrent request), and the whole memory usage of
> backup is limited at 128 MB.)
> 
> (OTOH, the minimum should probably be 1 MB.)
> 

Hmmm, I am preparing a patch set about backup, which includes increasing
of copied chunk size.. And somehow it leads to performance degradation on my
hdd.


===


What about the following solution: add empty qcow2 with cluster_size = 4M (ohh,
2M is maximum unfortunately) above ceph, enable copy-on-read on this node and 
start
backup from it? The qcow2 node will be a local cache, which will solve both 
problem
with unaligned read from ceph and copy-before-write time?

-- 
Best regards,
Vladimir