Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: bd - generic block device driver 1.2. Name of Document Author/Supplier: Author: Garrett D'Amore 1.3 Date of This Document: 29 November, 2009 4. Technical Description
Background ---------- There are a number of storage devices which express a simple block oriented architecture, but which are not truly SCSI devices. Examples of such devices are various flash media (e.g. SDcard, CF, and Memory Stick) and more recently storage adapters like the DDRdrive X1 (www.ddrdrive.com). These devices are not natively SCSI, and don't understand on their own the SCSI command set. As part of PSARC 2007/654, we introduced a translation layer (blk2scsa) which processes SCSI packets and allows these devices to be presented on a logical SCSI bus so that they can be supported by sd(7D). While this approach has so far met with some success, we've gained some experience and this approach has been found to add significant additional complexity to the system, with consequent impacts on performance, diagnosability, and maintainability. The action of creating a SCSI packet (done by sd(7d)) only to have to parse it in software later in the HBA is fairly inefficient. We therefore would like introduce a new block device driver (bd), to be used instead of blk2scsa, in order to simplify the system and increase total performance, with fewer total lines of supporting code. Because we might in the future like to offers support for some of these storage adapters on Solaris 10, we are requesting Patch binding, although we have no specific plans to backport at this time. Architecture ------------ The "bd" driver will be used as a block-oriented device driver for devices that need general block device support. Adapter device drivers will depend on this driver (-N drv/bd), and using functions supplied by it (described below in the "Block DDI", act as nexus drivers with bd leaves. bd itself supports labeling by importing the cmlb common labeling code, so these devices can support all of the same labeling conventions as magnetic SCSI disk devices. bd also supports the necessary dkio(7I) ioctls. Additionally, bd exports a new controller type in the dk_cinfo structure (used with DKIOCINFO), DKC_BD (#defined to value 24 in our current prototype, although this may change if the value is used by another project before we integrate). This new controller type is used to enable the use of a new plugin for libsmedia, sm_bd.so.1, which provides basic functionality for bd targets. bd has support for breaking large transfers up into smaller ones using partial DMA, or even for PIO style devices, so that adapter drivers need not concern themselves with this particular complexity. bd manages DMA mapping (if required) on behalf of the adapter driver, providing a scatter/gather list of DMA cookies to the driver as part of each job (if the adapter driver supports DMA.) Assumptions and Limitations --------------------------- bd targets may be hotpluggable, and may be removable. There is no support in this integration for door lock, media load, or ejection mechanisms. bd targets are assumed to have linear addressibility, and a fixed 512 byte block size. (Adapter drivers that require a different native block size may use read-modify-write if necessary.) bd is not optimized for rotating media. (sd, ssd, and cmdk are better suited to such media.) bd supports devices with an arbitrarily deep queue size, although the queue size itself is a fixed value determined at device registration. This allows for a very simple flow control model. bd provides no reprioritization. Jobs are submitted to the adapter device driver in the order received. However, they may be completed by the adapter driver in any order that is convenient for the adapter. bd lacks support for request cancellation. Once a job is submitted, it either completes or fails. bd lacks support for configurable timeouts. Once a job is submitted, it stays in the queue until it is serviced by the adapter driver. The adapter driver may elect to use a watchdog mechanism to provide timeouts at its own discretion. It is responsible for choosing an appropriate value for the timeout, if any is used. bd lacks any support in this integration for write cache management. If the adapter has a write cache, the adapter driver is wholly responsible for managing it "reasonably" and safely. bd assumes that if an adapter supports multiple bd targets, a simple integer index is sufficient to address each one. bd assumes the adapter driver will manage suspend/resume safely with respect to job submission. bd takes no special actions on suspend or resume -- that's up to the adapter driver to manage. bd assumes that adapter devices are able to manage their own power without the need for help from the framework. Since current bd media have neglible startup costs (no spin-up time), this is easy enough. (Although nothing prevents a driver with a higher startup cost from making use of the power(9e) framework to reduce thrashing on spin-up or spin-down.) The current prototype has an API for supporting hosting of crash dumps, but does not yet implement the dump(9e) support required. We may not get to doing this before integration. Consumers --------- Initially, the "bd" prototype will deliver with a separate driver for the DDRdrive X1 solid state storage device. That driver will be discussed in a separate PSARC case of its own which will depend on this case. As part of our prototype, we have also converted the SDcard memory card support to use bd instead of blk2scsa, which could potentially allow blk2scsa itself to be EOF'd. This effort will be discussed in a separate case as well, which will depend on this one. Block DDI --------- The following describes the DDI used by block device drivers. The following header must be included by all bd adapter drivers. #include <sys/bd.h> The following type is exposed to adapter drivers, and represents an opaque handle for a bd_target device. typedef struct bd_handle *bd_handle_t; /* opaque */ Adapter driver entry points are supplied via the following structure: typedef struct bd_ops { int o_version; void (*o_drive_info)(void *, bd_drive_t *); int (*o_media_info)(void *, bd_media_t *); int (*o_read)(void *, bd_xfer_t *); int (*o_write)(void *, bd_xfer_t *); int (*o_dump)(void *, bd_xfer_t *); } bd_ops_t; This structure is supplied by the adapter during handle allocation (see bd_alloc_handle() below.) The o_version field must be set by the adapter to BD_OPS_VERSION_0, and may be used to support versioning of the DDI in the future. The o_drive_info() entry point describes the logical drive. The first argument (void *) is a pointer to the driver soft state supplied at handle allocation time. The second argument is a pointer to a structure with the following definition: struct bd_drive { uint32_t d_qsize; uint32_t d_maxxfer; uint64_t d_wwn; boolean_t d_removable; boolean_t d_hotpluggable; }; The d_qsize indicates the depth of the job request queue. The d_maxxfer, if non-zero, represents the largest transfer that can be processed by the device. The d_wwn, if non-zero, represents a SCSI-3 style WWN for use in creating a devid. The remaining elements describe the capabilities of the device. The o_media_info() entry point describes the current media in the drive. (Drives with non-removable media will always return the same values here.) The media description is a follows: typedef struct bd_media { uint64_t m_nblks; boolean_t m_readonly; } The m_nblks is the total number of addressable blocks for the media, and the m_readonly indicates a non-writable media if true. The o_read() and o_write() entry points are used to handle a read or write transfer. The o_read or o_write function returns either 0 on success, or an errno. If 0 is returned, then the adapter is responsible for calling bd_xfer_done() asynchronously when the transfer is finished (whether succesfully or not.) The adapter driver MAY NOT call bd_xfer_done() on a request if it returns an errno. The adapter driver MAY NOT call bd_xfer_done() synchronously from this function (else recursive lock entry will result.) The bd_xfer_t type has the following public members: typedef struct bd_xfer { daddr_t x_blkno; size_t x_nblks; ddi_dma_handle_t x_dmah; ddi_dma_cookie_t x_dmac; unsigned x_ndmac; caddr_t x_kaddr; } bd_xfer_t; The x_blkno is the logical block address that the transfer starts at, and the x_nblks member is the total number of blocks to be transfered (it will always be a positive value.) If the device supports DMA, the x_dmah, x_dmac, and x_ndmac describe the DMA transfer. If x_ndmac is larger than 1, then the adapter driver must use ddi_dma_nextcookie(9f) to obtain the DMA cookie for the next entry in the scatter/gather list. If the adapter device does not support DMA, the x_kaddr is the kernel virtual address for the transfer. The o_dump() entry point is used to write blocks to a disk synchronously, in support of dump(9e). It may not block or use interrupts, and may not call bd_xfer_done(). Instead, it simply returns 0 or an errno when the transfer is complete. The following functions may be called by the adapter driver: bd_handle_t bd_alloc_handle(dev_info_t *dip, unsigned addr, void *private, bd_ops_t *ops, ddi_dma_attr_t *attr); Allocates a handle for a target bd device. The dip is for the adapter device (parent). The addr is the address or index of the bd target on the adapter. (Adapters that only support single targets should probably supply 0 here.) The private is a pointer to driver state that is supplied to the entry points in the ops vector. The attr describes the DMA capabilities of the adapter (for transfers). If the adapter driver does not use DMA, then NULL may be supplied for attr. This function may be called in user or kernel context only. void bd_free_handle(bd_handle_t handle); Frees a previously allocated handle. Note that it is an error to free a handle that is attached. May be called in user or kernel context only. int bd_attach_handle(bd_handle_t handle); Attaches a handle, creating a node in the device tree for the bd target device, and attaching its driver. Returns DDI_SUCCESS on success or DDI_FAILURE on failure. May be called in user or kernel context only. int bd_detach_handle(bd_handle_t handle); Detaches a handle from the system, normally as part of a DDI_DETACH operation or as part of a hotplug operation. Returns DDI_SUCCESS on success or DDI_FAILURE on failure. May be called in user or kernel context only. void bd_state_change(bd_handle_t handle); Indicates a state change (media removal or insertion) occurred for the given handle. Only useful for removable media. May be called in kernel, user, or interrupt context. Caller must not hold any locks. void bd_xfer_done(bd_xfer_t *xfer, int result); Called by the adapter when the named transfer (xfer) is complete. The result is 0 for a successful transfer, or an errno if the transfer failed. May be called in kernel, user, or interrupt context. Caller must not hold any locks. void bd_mod_init(struct dev_ops *); void bd_mod_fini(struct dev_ops *); Called by the adapter driver to configure its dev_ops structure during _init(9e), or to deconfigure it during _fini(9e). Imported Interfaces ------------------- Interface Stability Comments cmlb Consolidation Private Disk labelling support. Includes the misc/cmlb module and the cmlb API. nexus NDI Consolidation Private Needed for nexus device support. libsmedia Consolidation Private Generic storage media library, we import interfaces to supply a plugin. Exported Interfaces ------------------- bd(7d) Committed bd device driver. dkio(7I) Committed Standard ioctls for disks. DKC_BD Committed New dkio controller type. Block DDI Consolidation Private Used by adapter drivers. See Block DDI above. sm_bd.so.1 Consolidation Private libsmedia plugin. Both 32 and 64-bit versions. 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open