On Nov 29, 2006, at 13:24, [EMAIL PROTECTED] wrote:


I suspect a lack of an MBR could cause some BIOS implementations to
barf ..

Why?

Zeroed disks don't have that issue either.


you're right - I was thinking that a lack of an MBR with a GPT could be causing problems, but actually it looks like we do write a protective MBR in efi_write() - so it's either going to be the GPT header at LBA1 or backwards compatibility with the version 1.00 spec that the BIOS vendors aren't dealing with correctly. Proprietary BIOS RAID signatures does sound quite plausible as a common cause for problems.

Digging a little deeper, I'm thinking some of our EFI code might be a little old ..

in efi_partition.h we've got the following defined for dk_gpt and dk_part:
    161 /* Solaris library abstraction for EFI partitons */
    162 typedef struct dk_part  {
    163         diskaddr_t      p_start;        /* starting LBA */
    164         diskaddr_t      p_size;         /* size in blocks */
    165         struct uuid     p_guid;         /* partion type GUID */
    166         ushort_t        p_tag;          /* converted to part'n type 
GUID */
    167         ushort_t        p_flag;         /* attributes */
    168         char            p_name[EFI_PART_NAME_LEN]; /* partition name */
    169         struct uuid     p_uguid;        /* unique partition GUID */
    170         uint_t          p_resv[8];      /* future use - set to zero */
    171 } dk_part_t;
    172
    173 /* Solaris library abstraction for an EFI GPT */
    174 #define EFI_VERSION102          0x00010002
    175 #define EFI_VERSION100          0x00010000
    176 #define EFI_VERSION_CURRENT     EFI_VERSION100
    177 typedef struct dk_gpt {
    178         uint_t          efi_version;    /* set to EFI_VERSION_CURRENT */
    179         uint_t          efi_nparts;     /* number of partitions below */
    180         uint_t          efi_part_size;  /* size of each partition entry 
*/
    181                                         /* efi_part_size is unused */
    182         uint_t          efi_lbasize;    /* size of block in bytes */
    183         diskaddr_t      efi_last_lba;   /* last block on the disk */
    184         diskaddr_t      efi_first_u_lba; /* first block after labels */
185 diskaddr_t efi_last_u_lba; /* last block before backup labels */
    186         struct uuid     efi_disk_uguid; /* unique disk GUID */
    187         uint_t          efi_flags;
    188         uint_t          efi_reserved[15]; /* future use - set to zero */
    189         struct dk_part  efi_parts[1];   /* array of partitions */
    190 } dk_gpt_t;

which looks lke we're using the EFI Version 1.00 spec and looking at cmd/zpool/zpool_vdev.c we call efi_write() which does the label and writes the PMBR at LBA0 (first 512B block), the EFI header at LBA1 and should reserve the next 16KB for other partition tables .. [now we really should be using EFI version 1.10 with the -001 addendum (which is what 1.02 morphed into about 5 years back) or version 2.0 in the UEFI space .. but that's a separate discussion, as the address boundaries haven't really changed for device labels.]

in uts/common/fs/zfs/vdev_label.c we define the zfs boot block
    500
    501         /*
    502          * Initialize boot block header.
    503          */
    504         vb = zio_buf_alloc(sizeof (vdev_boot_header_t));
    505         bzero(vb, sizeof (vdev_boot_header_t));
    506         vb->vb_magic = VDEV_BOOT_MAGIC;
    507         vb->vb_version = VDEV_BOOT_VERSION;
    508         vb->vb_offset = VDEV_BOOT_OFFSET;
    509         vb->vb_size = VDEV_BOOT_SIZE;

which gets written down at the 8KB boundary after we start usable space from LBA34:
    857         vtoc->efi_parts[0].p_start = vtoc->efi_first_u_lba;

[note: 17KB isn't typically well aligned for most logical volumes .. it would probably be better to start writing data at LBA1024 so we stay well aligned for logical volumes with stripe widths up to 512KB and avoid the R/M/W misalignment that can occur there .. currently with a 256KB vdev label, I believe we start the data portion out on LBA546 which seems like a problem]

and then we apparently store a backup vtoc right before the backup partition table entries and backup GPT:
    858         vtoc->efi_parts[0].p_size = vtoc->efi_last_u_lba + 1 -
    859             vtoc->efi_first_u_lba - resv;

this next bit is interesting since we should probably define a GUID for ZFS partitions that points to the ZFS vdev label instead of using V_USR
    860
    861         /*
862 * Why we use V_USR: V_BACKUP confuses users, and is considered 863 * disposable by some EFI utilities (since EFI doesn't have a backup 864 * slice). V_UNASSIGNED is supposed to be used only for zero size 865 * partitions, and efi_write() will fail if we use it. V_ROOT, V_BOOT, 866 * etc. were all pretty specific. V_USR is as close to reality as we
    867          * can get, in the absence of V_OTHER.
    868          */
    869         vtoc->efi_parts[0].p_tag = V_USR;
    870         (void) strcpy(vtoc->efi_parts[0].p_name, "zfs");

and here we define the backup vdev label on the last usable LBA before our standard 8MB(?) reservation (16384 blocks) at the end of the disk and do the efi_write():
    871
    872         vtoc->efi_parts[8].p_start = vtoc->efi_last_u_lba + 1 - resv;
    873         vtoc->efi_parts[8].p_size = resv;
    874         vtoc->efi_parts[8].p_tag = V_RESERVED;
    875
    876         if (efi_write(fd, vtoc) != 0)

I'm thinking we should really define a GUID for ZFS and maybe do some better provisioning at the front end of the disk to be better aligned for full stripe write conditions .. with EFI we could use from LBA34 to LBA1023 for vdev labels and other stuff to start write aligning out on LBA1024. There also looks like a error(?) in the EFI reservation bits at the tail end of the disk since I thought the EFI spec only needed 16KB for the backup partitions and 512B for the GPT header .. not 16384 * 512B blocks .. for what it's worth that's also been in the format utility for a while now, so I could be missing something on the methodology for the 8MB reservation at the tail end of the disk.

.je
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to