Hi all, I've been thinking about how to surface GDAL errors in a better way for Python programmers. I'm pretty sure that the approaches I'm looking at generalize to GDAL's Python bindings and other language bindings. As well, I'm wondering if we can't improve GDAL's internal error handling in some core code. I'd love some feedback on my reasoning and design ideas from any angle. For example, I know that there is some prior art in Thomas Bonfort's Go modules, and expect there is some in rgdal. Please let me know what you think.
I'm an author of the Rasterio project for Python. This project has a number of problems handling GDAL errors. Generally, rasterio only checks the GDAL error context after a GDAL function returns and so can only see the last error that was set. Deeper errors may leak out to stderr, but a Python programmer using rasterio can't do anything about them using Python language features like try/except. This is a flaw in rasterio and stems from some naive analysis on my part about how errors are handled internally in GDAL. I assumed that functions in GDAL core and driver code consistently handle errors set by the functions they call and then set an error that describes exactly what a caller can do in the case of failure. Consider OGRXLSDataSource::Open at https://github.com/OSGeo/gdal/blob/35c07b18316b4b6d238f6d60b82c31e25662ad27/ogr/ogrsf_frmts/xls/ogrxlsdatasource.cpp#L116-L118. The code resets the error context, pushes GDAL's silencing handler so that no other handlers (like GDAL's default which prints to stderr) receive error events, calls CPLRecode, and then executes more statements if CPLRecode set an error. This looks to me like GDAL's equivalent of what might be written in Python as try: CPLRecode(...) except: CPLGenerateTemporaryFilename(...) ... In many ways, GDAL's error system is not unlike Python's at the C level. Python extension code that fails is supposed to set an error and return a particular value. When callers get that return value, they are to check for a set error and should either return with an error-indicating value (leaving the set error in place), or they can handle the error by clearing it and continuing, maybe setting a new error if recovery isn't possible. OGRXLSDataSource::Open does this. A rasterio user doesn't need to see farther into OGRXLSDataSource::Open than the last error set. GDAL and Python error reporting and handling are well aligned. I see different behavior when rasterio calls GDALDatasetRasterIOEx to read data from a GeoTIFF. The silencing handler is not used, so error events are printed to stderr, but callers set new errors on top of the previous ones. A rasterio users sees the deeper causes of I/O failure in their logs, but can't react to them in their programs without extra work to parse errors out of log messages. Specifically, here's a snippet of errors printed to stderr that was provided by a rasterio user recently. These result from a call to GDALDatasetRasterIOEx. ERROR 1: TIFFFillTile:No space for data buffer at scanline 4294967295 ERROR 1: TIFFReadEncodedTile() failed. ERROR 1: /home/ubuntu/Documents/CDL_tiffs/2015_30m_cdls.tif, band 1: IReadBlock failed at X offset 189, Y offset 60: TIFFReadEncodedTile() failed. "IReadBlock failed" is the last error set before GDALDatasetRasterIOEx returns and is the only one that rasterio can currently surface as a Python exception. It's specific about the block address at which a problem occurred, but vague about the nature of the root problem. Was it a codec error? Was it a memory allocation error? In this case it's a memory allocation error. The user found that they could retry data reads and get results the next time, presumably after their program's memory footprint shrinks sufficiently. What if we could surface enough error detail to a user that they could determine whether they could retry a read or not? In https://github.com/rasterio/rasterio/pull/2526/files#diff-a263c7288922a4c1ffd8318c15dfd3332babeb13edc7023662cb8cd7d69643b5R219 I am testing a hypothesis that the three consecutive, related errors above might be usefully surfaced to a Python programmer in a chain of exceptions. I've written an thing that records GDAL error events (intercepting them before they go to stderr), links them together, and then raises the last one. A Python programmer can catch RasterioIOError (what is raised in the "IReadBlock failed" case) and in handling that exception can follow the chain. At the very least, my experiment will show "CPLE_AppDefinedError: TIFFFillTile:No space for data buffer at scanline 4294967295" in Python tracebacks, which could be a big help for rasterio users who are debugging. Information that would otherwise be only in their logs would now be in the traceback. For example, here is the traceback we can get when trying to read a deliberately corrupted COG: (venv) seangillies@PF3675VY:~/projects/rasterio$ rio insp tests/data/corrupt.tif Rasterio 1.4dev Interactive Inspector (Python 3.8.10)Type "src.meta", "src.read(1)", or "help(src)" for more information.>>> src.read()rasterio._err.CPLE_AppDefinedError: TIFFFillTile:Read error at row 512, col 0, tile 3; got 38232 bytes, expected 47086 The above exception was the direct cause of the following exception: rasterio._err.CPLE_AppDefinedError: TIFFReadEncodedTile() failed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "rasterio/_io.pyx", line 934, in rasterio._io.DatasetReaderBase._read io_multi_band(self._hds, 0, xoff, yoff, width, height, out, indexes_arr, resampling=resampling) File "rasterio/_io.pyx", line 166, in rasterio._io.io_multi_band with stack_errors(): File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__ next(self.gen) File "rasterio/_err.pyx", line 245, in stack_errors raise lastrasterio._err.CPLE_AppDefinedError: /home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1: IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile() failed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<console>", line 1, in <module> File "rasterio/_io.pyx", line 610, in rasterio._io.DatasetReaderBase.read out = self._read(indexes, out, window, dtype, resampling=resampling) File "rasterio/_io.pyx", line 937, in rasterio._io.DatasetReaderBase._read raise RasterioIOError("Read or write failed. {}".format(cplerr)) from cplerrrasterio.errors.RasterioIOError: Read or write failed. /home/seangillies/projects/rasterio/tests/data/corrupt.tif, band 1: IReadBlock failed at X offset 1, Y offset 1: TIFFReadEncodedTile() failed. I think this could make communication in the Rasterio issue tracker much more productive. More information about the causes of an error is right there in the traceback instead of being split between traceback and stderr (or other log stream). It could at least eliminate one round of asking for more error detail in a bug report. And there's the ability to catch an exception and go up the chain in code, potentially very powerful when you need it. The effectiveness of this error recorder and chainer could depend on how many different styles of error handling exist in GDAL. I've pointed out two kinds above. In OGRXLSDataSource::Open, we have error handling that actively prevents error events from being emitted until the function gives up on trying to handle errors. In IReadBlock, there doesn't seem to be any such error handling involving the GDAL error context. I believe we've seen cases of GDAL functions that set errors while returning a success error code. It's possible that some functions return a failed error code while not setting any error. Lots of different cases could make the error recording and chaining approach fruitless. Are there other styles or paradigms in use? Are there GDAL modules that will challenge the assumptions that I'm making as I write my error recorder? If you know of any, I'd love to hear about them. Here are a few links for reference: * Exception handling in Python's C API: https://docs.python.org/3/c-api/exceptions.html#exception-handling (I feel like GDAL could use some documentation like this). * Python exception chaining: https://docs.python.org/3/tutorial/errors.html#exception-chaining * On the difference between an exception raised while handling and "raise from": https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with -- Sean Gillies
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev