Re: [gdal-dev] kerchunk

Even Rouault via gdal-dev Wed, 24 Jul 2024 03:00:04 -0700

Michael,

I don't think this would be a frmts/raw driver, but rather a/vsikerchunk virtual file system that you would combine with the Zarr driver

So you would open a dataset with "/vsikerchunk/{path/to.json}", and theZARR driver would then issue a ReadDir() operation on/vsikerchunk/{path/to.json}, which would return the top level keys ofthe JSON. Then the Zarr driver would issue a Open() operation on"/vsikerchunk/{path/to.json}/.zmetadata", and so on. The Zarr drivercould be essentially unmodified. This is I believe essentially how thePython implementation works when combining the Kerchunk specific partwith the Python Zarr module (except it passes file system objects andnot strings).

Where things don't get pretty is for big datasets, where that JSON filecan become so big that parsing it and holding it in memory becomes anannoyance. They have come apparently to using a hierarchy of Parquetfiles to store the references to the blocks:https://fsspec.github.io/kerchunk/spec.html#parquet-references . That'sbecoming a bit messy. Should be implementable though

There are also subtelties in Kerchunk v1 with jinja substitution, andgenerators of keys, all tricks to decrease the size of the JSON, thatwould complicate an implementation.

On Kerchunk itself, I don't have any experience, but I feel there mightbe limitations to what it can handle due to the underlying rasterformats. For example, if you have a GeoTIFF file using JPEG compression,with the quantization tables being stored in the TIFF JpegTables tag(i.e. shared for all tiles), which is the formulation that GDAL woulduse by default on creation, then I don't see how Kerchunk can deal withthat, since that would be 2 distincts chunks in the file, and therecombination is slightly more complicated than just appending themtogether before passing them to a JPEG codec. Similarly if you wanted toKerchunk a GeoPackage raster, you couldn't, because a single tile inSQLite3 generally spans over multiple SQLite3 pages (of size 4096), witha few "header" bytes at the beginning of each tile. For GRIB2, there arecertainly limitations to some formulations because some GRIB2 encodingfor arrays are really particular. It must work only with the most simpleraw encoding.

Kerchunk can potentially do virtual tiling, but I believe that all tilesmust have the same dimensions, and their internal tiling to be amultiple of that dimension, so you can create a Zarr compatiblerepresentation of them.

And obviously one strong assumption of Kerchunk is that the filesreferenced by a Kerchunk index are immutable. If for some reason, tilesare moved internally because of updates, chaos will arise due to(offset, size) tuples being out of sync.


Even


Le 24/07/2024 à 00:37, Michael Sumner via gdal-dev a écrit :

Hi, is there any effort or thought into something like Python'skerchunk in GDAL? (my summary of kerchunk is below)
https://github.com/fsspec/kerchunk
I'll be exploring the python outputs in detail and looking for hooksinto where we might bring some of this tighter into GDAL. This wouldwork nicely inside the GTI driver, for example. But, a*kerchunk-driver*? That would be in the family of raw/ drivers, myskillset won't have much to offer but I'm going to explore with somesimpler examples. It could even bring old HDF4 files into the fold,I think.
It's a bit weird from a GDAL perspective to map the chunks in a formatfor which we have a driver, but there's definitely performanceadvantages and convenience for virtualizing huge disparate collections(even the simplest time-series-of-files in netcdf is nicely abstractedhere for xarray, a super-charged VRT for xarray).
Interested in any thoughts, feedback, pointers to related efforts ...thanks!
(my take on) A description of kerchunk:
kerchunk replaces the actual binary blobs on file in a Zarr with jsonreferences to a file/uri/object and the byte start and end values, inthis way kerchunk brings formats like hdf/netcdf/grib into the fold of"cloud readiness" by having a complete separation of metadata from theactual storage. The information about those chunks (compression, type,orientation etc is stored in json also).
(a Zarr is a multidimensional version of a single-zoom-level imagetiling, imagine every image tile as a potentially n-dimensional childblock of a larger array. The blobs are stored like one zoom of anz/y/x tile server [[[v/]w/]y/]x way (with a position for eachdimension of the array, 1, 2, 3, 4, or n, and z is not special, andwith more general encoding possibilities than tif/png/jpeg provide.)This scheme is extremely general, literally a virtualized array-likeabstraction on any storage, and with kerchunk you can transcend manylegacy issues with actual formats.
Cheers, Mike


--
Michael Sumner
Research Software Engineer
Australian Antarctic Division
Hobart, Australia
e-mail: mdsum...@gmail.com

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev


--
http://www.spatialys.com
My software is free, but my time generally not.

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] kerchunk

Reply via email to