Re: [gdal-dev] [BULK] Re: [EXTERNAL] Re: GTiff bit shuffle compression feature request

2023-12-08 Thread Rahkonen Jukka via gdal-dev
Hi,

Could Zarr be used as Sozipped https://gdal.org/programs/sozip.html?

-Jukka Rahkonen-

Lähettäjä: gdal-dev  Puolesta Meyer, Jesse R. 
(GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] via gdal-dev
Lähetetty: perjantai 8. joulukuuta 2023 21.44
Vastaanottaja: Even Rouault ; gdallists 

Aihe: Re: [gdal-dev] [BULK] Re: [EXTERNAL] Re: GTiff bit shuffle compression 
feature request

Unfortunately Zarr has a design choice that won't work for us: blocks are 
individual files on a file system.  Our datasets are massive and this will 
explode our inode allocations.  While we could archive the folder into a zip 
archive, it adds a step for anyone to work with the data.  Curiously, this 
sparse friendly representation seems totally baked into the format and there's 
no way to opt out.  I'm not quite ready to share compression ratio findings, 
but initial results are consistent with my expectations.  Bitshuffle is 
effective for our data and works very well with another Zarr feature 
"DELTA_DTYPE" which _can_ be a form of lossless compression if the max delta is 
known.

I understand not wanting to make third party tiff compliance worse than it 
already is from the GDAL project perspective.  That would be minimized if the 
functionality were added to libtiff proper, so any project that depends on 
libtiff could benefit from the enhancement.

From: gdal-dev 
mailto:gdal-dev-boun...@lists.osgeo.org>> on 
behalf of "Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] 
via gdal-dev" mailto:gdal-dev@lists.osgeo.org>>
Reply-To: "Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC]" 
mailto:jesse.r.me...@nasa.gov>>
Date: Friday, December 8, 2023 at 12:40 PM
To: Even Rouault 
mailto:even.roua...@spatialys.com>>, gdallists 
mailto:gdal-dev@lists.osgeo.org>>
Subject: [BULK] Re: [gdal-dev] [EXTERNAL] Re: GTiff bit shuffle compression 
feature request

Thanks for the suggestion Even, we'll see how effective Zarr is for our 
datasets.

Jesse

From: Even Rouault 
mailto:even.roua...@spatialys.com>>
Date: Friday, December 8, 2023 at 12:20 PM
To: "Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC]" 
mailto:jesse.r.me...@nasa.gov>>, gdallists 
mailto:gdal-dev@lists.osgeo.org>>
Subject: [EXTERNAL] Re: [gdal-dev] GTiff bit shuffle compression feature request

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Jesse,

This would break interoperability with other TIFF readers... Even adding a new 
TIFF tag to advertize that bit shuffling is applied would probably not be a 
sufficient guard, as existing readers wouldn't read it, and would just display 
garbage, which is worth that not being able to open the file at all. The only 
way I can think off of doing that in a safe way would be to use new values for 
the Compression tag, which isn't pretty either.

You should probably try Zarr which has such capability with the Blosc codec. Cf 
https://gdal.org/drivers/raster/zarr.html : BLOSC_SHUFFLE

I'm curious however to know which typical compression gain you get with that.

Even


Le 08/12/2023 à 18:06, Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] via gdal-dev a écrit :
Hi,

When using horizonal differencing to reduce the numerical range of band data, 
the upper bytes in the produced stream are typically 0 which leverages LZ's 
byte based compression model.  But the least significant bytes can still have 
many significant bits as 0. Unless the whole byte is replicated, LZ compressors 
can't do much to leverage the pattern however.  For data with temporal and or 
spatial coherence, 'shuffling' is another effective strategy to losslessly 
reform the data stream to be favorable to LZ style compressors.  And plays 
nicely off gains already provided by the PREDICTOR functionality.

The notion is to arrange the bit stream where the Nth "shuffled" byte contains 
the Nth bit from each byte in the sequence.  The sequence length is usually 
determined by the data type bit length.

For example (for brevity, assume bytes are 4 bits long)

Byte 1,  Byte 2, Byte 3, Byte 4
0001, 0011, 0111, 0001

They all share the top 0 bit and the bottom 1 bit,

"Shuffled"
, 0010, 0110, 

The algorithm is pretty simple to implement, and can be SIMD accelerated for 
high performance.

While we specifically are users of the GTIFF format, such a strategy could be 
employed generically for most raster and even vector formats.

Best,
Jesse


___

gdal-dev mailing list

gdal-dev@lists.osgeo.org<mailto:gdal-dev@lists.osgeo.org>

https://lists.osgeo.org/mailman/listinfo/gdal-dev

--

http://www.spatialys.com<http://www.spatialys

Re: [gdal-dev] [BULK] Re: [EXTERNAL] Re: GTiff bit shuffle compression feature request

2023-12-08 Thread Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] via gdal-dev
Unfortunately Zarr has a design choice that won’t work for us: blocks are 
individual files on a file system.  Our datasets are massive and this will 
explode our inode allocations.  While we could archive the folder into a zip 
archive, it adds a step for anyone to work with the data.  Curiously, this 
sparse friendly representation seems totally baked into the format and there’s 
no way to opt out.  I’m not quite ready to share compression ratio findings, 
but initial results are consistent with my expectations.  Bitshuffle is 
effective for our data and works very well with another Zarr feature 
“DELTA_DTYPE” which _can_ be a form of lossless compression if the max delta is 
known.

I understand not wanting to make third party tiff compliance worse than it 
already is from the GDAL project perspective.  That would be minimized if the 
functionality were added to libtiff proper, so any project that depends on 
libtiff could benefit from the enhancement.

From: gdal-dev  on behalf of "Meyer, Jesse R. 
(GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC] via gdal-dev" 

Reply-To: "Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC]" 

Date: Friday, December 8, 2023 at 12:40 PM
To: Even Rouault , gdallists 

Subject: [BULK] Re: [gdal-dev] [EXTERNAL] Re: GTiff bit shuffle compression 
feature request

Thanks for the suggestion Even, we’ll see how effective Zarr is for our 
datasets.

Jesse

From: Even Rouault 
Date: Friday, December 8, 2023 at 12:20 PM
To: "Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND APPLICATIONS INC]" 
, gdallists 
Subject: [EXTERNAL] Re: [gdal-dev] GTiff bit shuffle compression feature request

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Jesse,

This would break interoperability with other TIFF readers... Even adding a new 
TIFF tag to advertize that bit shuffling is applied would probably not be a 
sufficient guard, as existing readers wouldn't read it, and would just display 
garbage, which is worth that not being able to open the file at all. The only 
way I can think off of doing that in a safe way would be to use new values for 
the Compression tag, which isn't pretty either.

You should probably try Zarr which has such capability with the Blosc codec. Cf 
https://gdal.org/drivers/raster/zarr.html : BLOSC_SHUFFLE

I'm curious however to know which typical compression gain you get with that.

Even


Le 08/12/2023 à 18:06, Meyer, Jesse R. (GSFC-618.0)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] via gdal-dev a écrit :
Hi,

When using horizonal differencing to reduce the numerical range of band data, 
the upper bytes in the produced stream are typically 0 which leverages LZ’s 
byte based compression model.  But the least significant bytes can still have 
many significant bits as 0. Unless the whole byte is replicated, LZ compressors 
can’t do much to leverage the pattern however.  For data with temporal and or 
spatial coherence, ‘shuffling’ is another effective strategy to losslessly 
reform the data stream to be favorable to LZ style compressors.  And plays 
nicely off gains already provided by the PREDICTOR functionality.

The notion is to arrange the bit stream where the Nth “shuffled” byte contains 
the Nth bit from each byte in the sequence.  The sequence length is usually 
determined by the data type bit length.

For example (for brevity, assume bytes are 4 bits long)

Byte 1,  Byte 2, Byte 3, Byte 4
0001, 0011, 0111, 0001

They all share the top 0 bit and the bottom 1 bit,

“Shuffled”
, 0010, 0110, 

The algorithm is pretty simple to implement, and can be SIMD accelerated for 
high performance.

While we specifically are users of the GTIFF format, such a strategy could be 
employed generically for most raster and even vector formats.

Best,
Jesse


___

gdal-dev mailing list

gdal-dev@lists.osgeo.org

https://lists.osgeo.org/mailman/listinfo/gdal-dev

--

http://www.spatialys.com

My software is free, but my time generally not.
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev