On Thu, 04 Oct 2012 19:23:13 -0600 Shuah Khan <shuah.k...@hp.com> wrote:
> A recent dma mapping error analysis effort showed that a large percentage > of dma_map_single() and dma_map_page() returns are not checked for mapping > errors. > > Reference: > http://linuxdriverproject.org/mediawiki/index.php/DMA_Mapping_Error_Analysis > > Adding support for tracking dma mapping and unmapping errors to help assess > the following: > > When do dma mapping errors get detected? > How often do these errors occur? > Why don't we see failures related to missing dma mapping error checks? > Are they silent failures? > > Patch v4: Addresses extra tab review comments from Patch v3. > Patch v3: Addresses review and design comments from Patch v2. > Patch v2: Addressed design issues from Patch v1. > > Enhance dma-debug infrastructure to track dma mapping, and unmapping errors. > > map_errors: (system wide counter) > Total number of dma mapping errors returned by the dma mapping interfaces, > in response to mapping requests from all devices in the system. > map_errors_not_checked: (system wide counter) > Total number of dma mapping errors devices failed to check before using > the returned address. > unmap_errors: (system wide counter) > Total number of times devices tried to unmap or free an invalid dma > address. > map_err_type: (new field added to dma_debug_entry structure) > New field to maintain dma mapping error check status. This error type > is applicable to the dma map page and dma map single entries tracked by > dma-debug api. This status indicates whether or not a good mapping is > checked by the device before its use. dma_map_single() and dma_map_page() > could fail to create a mapping in some cases, and drivers are expected to > call dma_mapping_error() to check for errors. Please note that this is not > counter. > > Enhancements to dma-debug api are made to add new debugfs interfaces to > report total dma errors, dma errors that are not checked, and unmap errors > for the entire system. Please note that these are system wide counters for > all devices in the system. > > The following new dma-debug interface is added: > > debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) > Sets dma map error checked status for the dma map entry if one is > found. Decrements the system wide dma_map_errors_not_checked counter > that is incremented by debug_dma_map_page() when it checks for > mapping error before adding it to the dma debug entry table. > > New dma-debug internal interface: > check_mapping_error(struct device *dev, dma_addr_t dma_addr) > Calling dma_mapping_error() from dma-debug api will result in dma mapping > check when it shouldn't. This internal routine checks dma mapping error > without any debug checks. > > The following existing dma-debug api are changed to support this feature: > debug_dma_map_page() > Increments dma_map_errors and dma_map_errors_not_checked errors totals > for the system, dma-debug api keeps track of, when dma_addr is invalid. > This routine now calls internal check_mapping_error() interface to > avoid doing dma mapping debug checks from dma-debug internal mapping > error checks. > check_unmap() > This is an existing internal routines that checks for various mapping > errors. Changed to increment system wide dma_unmap_errors, when a > device requests an invalid address to be unmapped. This routine now > calls internal check_mapping_error() interface to avoid doing dma > mapping debug checks from dma-debug internal mapping error checks. > > Changed arch/x86/include/asm/dma-mapping.h to call debug_dma_mapping_error() > to validate these new interfaces on x86_64. Other architectures will be > changed in a subsequent patch. > > Tested: Intel iommu and swiotlb (iommu=soft) on x86-64 with > CONFIG_DMA_API_DEBUG enabled and disabled. Still seems overly complicated to me, but whatev. I think the way to handle this is pretty simple: set a flag in the dma entry when someone runs dma_mapping_error() and, if that flag wasn't set at unmap time, emit a loud warning. >From my reading of the code, this patch indeed does that, along with a bunch of other (unnecessary?) stuff. But boy, the changelog conceals this information well! > Documentation/DMA-API.txt | 13 +++++ > arch/x86/include/asm/dma-mapping.h | 1 + > include/linux/dma-debug.h | 7 +++ > lib/dma-debug.c | 110 > ++++++++++++++++++++++++++++++++++-- Please, go through Documentation/DMA-API-HOWTO.txt with a toothcomb and ensure that all this is appropriately covered and that the examples are completed for error checking. > > ... > > +static inline int check_mapping_error(struct device *dev, dma_addr_t > dma_addr) > +{ > + const struct dma_map_ops *ops = get_dma_ops(dev); > + if (ops->mapping_error) > + return ops->mapping_error(dev, dma_addr); > + > + return (dma_addr == DMA_ERROR_CODE); > +} I'm not a fan of functions called check_foo() because I never know whether their return value means "foo is true" or "foo is false". It doesn't help that the return value here is undocumented. Names such as has_mapping_error() or mapping_has_error() will remove this question, thereby making the call sites more readable. > static void check_unmap(struct dma_debug_entry *ref) > { > struct dma_debug_entry *entry; > struct hash_bucket *bucket; > unsigned long flags; > > - if (dma_mapping_error(ref->dev, ref->dev_addr)) { > + if (unlikely(check_mapping_error(ref->dev, ref->dev_addr))) { > + unmap_errors += 1; > err_printk(ref->dev, NULL, "DMA-API: device driver tries " > "to free an invalid DMA memory address\n"); > return; > @@ -915,6 +975,15 @@ static void check_unmap(struct dma_debug_entry *ref) > dir2name[ref->direction]); > } > > + if (entry->map_err_type == MAP_ERR_NOT_CHECKED) { > + err_printk(ref->dev, entry, > + "DMA-API: device driver failed to check map error" > + "[device address=0x%016llx] [size=%llu bytes] " > + "[mapped as %s]", > + ref->dev_addr, ref->size, > + type2name[entry->type]); > + } It's important that this warning be associated with a backtrace so we can identify the offending call site in the usual fashion. err_printk() does include a backtrace under some circumstances, but those circumstances hurt my brain. Is it guaranteed that we'll have that backtrace? If not, I'd suggest making it so. > hash_bucket_del(entry); > dma_entry_free(entry); > > > ... > > +void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) > +{ > + struct dma_debug_entry ref; > + struct dma_debug_entry *entry; > + struct hash_bucket *bucket; > + unsigned long flags; > + > + if (unlikely(global_disable)) > + return; > + > + ref.dev = dev; > + ref.dev_addr = dma_addr; > + bucket = get_hash_bucket(&ref, &flags); > + entry = bucket_find_exact(bucket, &ref); > + > + if (!entry) { > + /* very likley dma-api didn't call debug_dma_map_page() or > + debug_dma_map_page() detected mapping error */ > + if (map_errors_not_checked) > + map_errors_not_checked -= 1; > + goto out; > + } > + > + entry->map_err_type = MAP_ERR_CHECKED; > +out: > + put_hash_bucket(bucket, &flags); > +} > +EXPORT_SYMBOL(debug_dma_mapping_error); Well, it's a global, exported-to-modules symbol. Some formal documentation is appropriate for such things. > > ... > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/