Hi,

Background: I'm the developer of the 
TorchGeo<https://github.com/microsoft/torchgeo> software library. TorchGeo is a 
machine learning library that heavily relies on GDAL (via rasterio/fiona) for 
satellite imagery I/O.

One of our primary concerns is ensuring that we can load data from disk fast 
enough to keep the GPU busy during model training. Of course, satellite imagery 
is often distributed in large files that make this challenging. We use various 
tricks to optimize performance (COGs, windowed reading, caching, compression, 
parallel workers, etc.). In our initial 
paper<https://arxiv.org/abs/2111.08872>, we chose to create our own arbitrary 
I/O benchmarking dataset composed of 100 Landsat scenes and 1 CDL map. See 
Figure 3 for the results, and Appendix A for the experiment details.

Question: is there an official dataset that the GDAL developers use to 
benchmark GDAL itself? For example, if someone makes a change to how GDAL 
handles certain I/O operations, I assume the GDAL developers will benchmark it 
to see if I/O is now faster or slower. I'm envisioning experiments similar to 
https://kokoalberti.com/articles/geotiff-compression-optimization-guide/ for 
various file formats, compression levels, block sizes, etc.

If such a dataset doesn't yet exist, I would be interested in creating one and 
publishing a paper on how this can be used to develop libraries like GDAL and 
TorchGeo.

Dr. Adam J. Stewart
Technical University of Munich
School of Engineering and Design
Data Science in Earth Observation

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to