Background I have been looking into using parquet files for storing and working with automotive data. One interesting thing about automotive data is that most communication happens on the CAN bus where we have extremely limited bandwidth. In order to encode "physical" values in a very space efficient way, we use linear conversion formulas that look like "phys = (raw * factor) + offset". This gives implicit range and resolution limits, but that is often just fine when we are representing a physical property.
Example 1: We have a throttle that can be anywhere from 0-100% and we want to fit that value into 1 byte. So we would use a formula like: phys = (raw * 0.39215) + 0 Example 2: We want to record ambient temperature of the vehicle. Resolution of 1 degree is fine. Also, temperatures below -40 and above 215 degrees C are not particularly useful as they are very rare and out of scope for a useful temperature. phys = (raw * 1.0) - 40 So far, I have been converting the raw data into floating point data before writing to arrow format to make it easier for the analysts to use the data. This of course means that I am converting to a less efficient format and I am also losing inherent information about the raw signal. I would rather be able to store the raw data in an appropriately sized unsigned integer and automatically convert to floating point when using the data, similar to dictionary encoding. Discussion - How would people generally deal with this situation using the arrow format? - Is this something that other people are interested in? - If this were to be added to the spec, what would be the best way to do it? While I am coming from an automotive perspective, I think there are many other areas of applicability (reading sensor data through an ADC, industrial automation and monitoring, etc.) I could see this working as either a new primitive type (similar to decimal), or as an extension where we simply put the factor and offset as standard metadata fields. Best regards, Elliot Morrison-Reed