Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-12-02 Thread Roland Kammerer
On Thu, Nov 25, 2021 at 11:23:35AM +0100, Roland Kammerer wrote:
> I reproduced it (1G alpine image from local-lvm), let's see what I can
> find out, currently I don't need input from your side.

Almost forgot to report back "my" findings:

tl;tr: "seems like qemu does not like moving from a smaller to a bigger disk
here.."

So please use offline migration: more details and links to forum posts
and proxmox bug reports here:

https://lists.proxmox.com/pipermail/pve-devel/2021-November/051103.html

Regards, rck
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-11-25 Thread G. Milo
Just in case it helps, I did some additional tests and the issue for me
seems to be narrowed down to the following...

Migration (online) from Local LVM/ZFS/DIR to ZFS backed LINSTOR storage
always succeeds, so the problem appears to be isolated when migrating from
local to a ThinLVM based LINSTOR storage.

On Thu, 25 Nov 2021 at 10:23, Roland Kammerer 
wrote:

> On Thu, Nov 25, 2021 at 08:57:45AM +0100, Roland Kammerer wrote:
> > On Wed, Nov 24, 2021 at 02:00:38PM +, G. Milo wrote:
> > > I was able to reproduce this issue on PVE 6.4 as well with the latest
> > > packages installed. Never used this combination before, so I'm not
> sure if
> > > it is something that started happening recently after updating PVE or
> > > LINSTOR packages..The task is cancelled almost immediately, without
> > > starting the migration process at all and the new linstor resource is
> > > removed instantly as well.
> > >
> > > https://pastebin.com/i4yuKYyp
> >
> > okay... so what do we have:
> > - it can happen from local lvm (Łukasz) and local zfs (Milo)
> > - it can happen with about 32G (Łukasz) and smaller 11G (Milo)
> >
> > Milo, as you seem to be able to reproduce it immediately, can you try
> > smaller volumes, like 2G? Does it happen with those as well?
> >
> > Does it need to be a running VM, or can it happen if the VM is turned
> > off as well?
> >
> > I will try to reproduce that later today/tomorrow, all information that
> > narrows that down a bit might help.
>
> I reproduced it (1G alpine image from local-lvm), let's see what I can
> find out, currently I don't need input from your side.
>
> Thanks, rck
> ___
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-11-25 Thread Roland Kammerer
On Thu, Nov 25, 2021 at 08:57:45AM +0100, Roland Kammerer wrote:
> On Wed, Nov 24, 2021 at 02:00:38PM +, G. Milo wrote:
> > I was able to reproduce this issue on PVE 6.4 as well with the latest
> > packages installed. Never used this combination before, so I'm not sure if
> > it is something that started happening recently after updating PVE or
> > LINSTOR packages..The task is cancelled almost immediately, without
> > starting the migration process at all and the new linstor resource is
> > removed instantly as well.
> > 
> > https://pastebin.com/i4yuKYyp
> 
> okay... so what do we have:
> - it can happen from local lvm (Łukasz) and local zfs (Milo)
> - it can happen with about 32G (Łukasz) and smaller 11G (Milo)
> 
> Milo, as you seem to be able to reproduce it immediately, can you try
> smaller volumes, like 2G? Does it happen with those as well?
> 
> Does it need to be a running VM, or can it happen if the VM is turned
> off as well?
> 
> I will try to reproduce that later today/tomorrow, all information that
> narrows that down a bit might help.

I reproduced it (1G alpine image from local-lvm), let's see what I can
find out, currently I don't need input from your side.

Thanks, rck
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-11-24 Thread Roland Kammerer
On Wed, Nov 24, 2021 at 02:00:38PM +, G. Milo wrote:
> I was able to reproduce this issue on PVE 6.4 as well with the latest
> packages installed. Never used this combination before, so I'm not sure if
> it is something that started happening recently after updating PVE or
> LINSTOR packages..The task is cancelled almost immediately, without
> starting the migration process at all and the new linstor resource is
> removed instantly as well.
> 
> https://pastebin.com/i4yuKYyp

okay... so what do we have:
- it can happen from local lvm (Łukasz) and local zfs (Milo)
- it can happen with about 32G (Łukasz) and smaller 11G (Milo)

Milo, as you seem to be able to reproduce it immediately, can you try
smaller volumes, like 2G? Does it happen with those as well?

Does it need to be a running VM, or can it happen if the VM is turned
off as well?

I will try to reproduce that later today/tomorrow, all information that
narrows that down a bit might help.

Thanks, rck
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-11-24 Thread G. Milo
I was able to reproduce this issue on PVE 6.4 as well with the latest
packages installed. Never used this combination before, so I'm not sure if
it is something that started happening recently after updating PVE or
LINSTOR packages..The task is cancelled almost immediately, without
starting the migration process at all and the new linstor resource is
removed instantly as well.

https://pastebin.com/i4yuKYyp
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Proxmox with Linstor: Online migration / disk move problem

2021-11-22 Thread Roland Kammerer
On Mon, Nov 22, 2021 at 03:28:10PM +0100, Łukasz Wąsikowski wrote:
> Hi,
> 
> I'm trying to migrate VM storage to Linstor SDS and have some odd troubles.
> All nodes are running Proxmox VE 7.1:
> 
> pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)
> 
> Linstor storage is, for now, on one host. When I create new VM on linstor it
> works. When I try to migrate VM from another host (and another storage) to
> Linstor it fails:
> 
> 2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3'
> (192.168.8.203)
> 2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in current
> VM config)
> 2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
> 2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is
> 'linstor-local:vm-116-disk-1' on the target
> 2021-11-22 13:07:01 start remote tunnel
> 2021-11-22 13:07:03 ssh tunnel ver 1
> 2021-11-22 13:07:03 starting storage migration
> 2021-11-22 13:07:03 scsi1: start migration to
> nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
> drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
> drive-scsi1: Cancelling block job
> drive-scsi1: Done.
> 2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror)
> error: drive-scsi1: 'mirror' has been cancelled
> 2021-11-22 13:07:03 aborting phase 2 - cleanup resources
> 2021-11-22 13:07:03 migrate_cancel
> 2021-11-22 13:07:08 ERROR: migration finished with problems (duration
> 00:00:16)
> TASK ERROR: migration problems
> 
> Linstor volumes are created during migration, no errors in it's logs. I
> don't know why Proxmox is cancelling this job.
> 
> When I try to move disk from NFS to Linstor (online) it fails:
> 
> create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)
> 
> NOTICE
> Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
> drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> TASK ERROR: storage migration failed: block job (mirror) error: drive-scsi0:
> 'mirror' has been cancelled
> 
> 
> To move storage to Linstor I have first move it to NFS (online), turn off VM
> and move VM storage offline to Linstor. And bizzare thing is that once I do
> it, I can move this particular VM storage from Linstor to NFS online and
> from NFS to Linstor online. I can also migrate VM online, from Linstor,
> directly to another node and another storage without problems.
> 
> I've setup test cluster to reproduce this problem and couldn't - online
> migration to Linstor storage just worked. I don't know why it's not working
> on main cluster - any hints how to debug it?

Hi Łukasz,

I have heard of that once before, but never experienced it myself and so
far no customers complained so I did not dive into it.

If you can reproduce it, that would be highly appreciated. To me it
looks like the plugin and LINSTOR basically did their job, but then
something else happens. This are just random thoughts that might be
complete nonsense:

- maybe some size rounding error and the resulting DRBD device is just a
  tiny bit too small. If you can reproduce it, I would check sizes of
  source/destination. If it starts writing and fails at the end it
  should start writing data. So does it take some time till it fails? Do
  you see that some data was written at the beginning of the DRBD block
  device that matches the source? But maybe there is already a size
  check at the beginning and it fails fast, who knows. Maybe try with a
  VM that has exactly the same size as the failing one in production.
- some race and the DRBD device isn't actually ready before the
  migration wants to write data. Maybe there is more time before a
  disk gets used when a VM is created vs. when existing data is written
  to a freshly created device + migration.
- check dmesg to see what happened on DRBD level
- start grepping for the error msgs in pve/pve-storage to see when and
  why these errors happen. What tool/function gets called and then
  manually call that tool several times in some "linstor spawn &&
  $magic_tool" to trigger a race (if there is one).

HTH, rck
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user