[DRBD-user] Out-of-sync blocs with VWare Workstation

Lionel Sausin Tue, 12 Nov 2013 01:47:16 -0800

Dear DRBD users and developers,

We have 2 clusters with several "simple" resources (single master,single resource per connection).Each resource contains either an LXC container, or a VMWare virtualmachine (VMWare Workstation v9.0.2).

We run "drbdadm verify all" every week-end and we noticed that theresources hosting VMWare machines often have out-of-sync blocs, and theones hosting LXCs never have any.

We've seen this on both clusters (config quoted below).

I was wondering what could be causing these out-of-sync blocs?
Can VMWare possibly be modifying in-flight data?
Is there a way I can make sure?

Thanks in advance if someone can shed some light on this issue.
Lionel Sausin.


---
Config on the oldest cluster:

Ubuntu 10.04, kernel 2.6.32 and DRBD 8.3.13. Its resources use ext4mounted with -o nobarrier. The storage is a hardware RAID10 on SSDs.


global {
        usage-count yes;
}

common {

        protocol C;

        # Actions to take in the face of special events
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                split-brain "/usr/lib/drbd/notify-split-brain.sh <<<cut>>>";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh <<<cut>>>";
                before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 
5 -- -c 16k";
                after-resync-target 
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        }

        startup {
        }

        net {
                # Restrict access to the resources with a shared secret
                cram-hmac-alg md5;
                shared-secret <<<cut>>>;
                
                # Congestion management lets writes flow without disconnecting
                # on-congestion pull-ahead;
                # congestion-fill 1M;
                
                # Go StandAlone if the peer is enreachable too long
                ko-count 10;
                # Allow 6 seconds for the other node's reply before we drop 
connections
                timeout 60;
        }

        syncer {
                # Compress the dirty-bitmaps
                use-rle;

                # Use checksumming to allow online verification
                # sha1 has fewer chances of hash collision but is CPU-hungry 
(Noa's CPU can only process up to 60MB/s)
                verify-alg md5;
                
                # Resync checksuming while verifying used lead to a deadlock, 
fixed in v8.3.11
                csums-alg md5;

                # Adaptative syncer rate: let DRBD decide the best sync speed
                #   initial sync rate
                rate 50M;
                #   size of the rate adaptation window
                c-plan-ahead 20;
                #   min/max rate
                #   The network will allow only up to ~110MB/s, but verify and 
identical-bloc resyncs use very little network BW
                c-max-rate 800M;
                #   quantity of sync data to maintain in the buffers (impacts 
the length of the wait queue)
                c-fill-target 100k;

                # Limit the bandwidth available for resync on the primary node 
when DRBD detects application I/O
                c-min-rate 8M;

                al-extents 1023;
        }
}

Typical resource config:

resource openerp {
  device    /dev/drbd_openerp minor 0;
  meta-disk internal;
  on NodeA {
    address   10.100.1.2:7788;
    disk      /dev/fast_vol/openerp;
  }
  on NodeB {
    address   10.100.1.3:7788;
    disk      /dev/fast_vol/openerp;
  }
}

---
Config on the newest cluster:

Ubuntu 12.04, kernel 3.8 (raring stack) and DRBD 8.4.2. Its resourcesuse ext4 with default options, and were created with -b 4096 -Estride=64,stripe-width=192. The storage is a SSD on the primary node,hardware RAID5 on the other side.


global_common.conf:

global {
    usage-count yes;
}

common {
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        split-brain "/usr/lib/drbd/notify-split-brain.sh <<<cut>>>";
        out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh <<<cut>>>";
        before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- 
-c 16k";
        after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
    }

    startup {
    }

    options {
    }

    disk {
    }

    net {

# Peer authentication

        cram-hmac-alg sha1;
        shared-secret <<<cut>>>;

# Skip sync when checksum matches

        csums-alg sha1;

        # Enable online verify
        verify-alg md5;
    }
}

Typical resource config:

resource web {
    device /dev/drbd_web minor 4;
    # Master
    on vmhost7 {
        address 10.100.0.14:7804;
        disk /dev/data/web;
        meta-disk internal;
    }
    # Slave
    on stockagec {
        address 10.100.0.13:7804;
        disk /dev/data1/web;
        meta-disk internal;
    }
}

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Out-of-sync blocs with VWare Workstation

Reply via email to