If the DB layer above Arrow supports it, I would define a (non-stored) calculated column. Given celsius_percent between 0 and 1, I would define fahrenheit as (32 + celsius_percent * 1.8). A good query optimizer would convert the condition 'where fahrenheit > 122' into 'where celsius_percent > 0.5'.
Extension types would work too, but calculations seem simpler. On Sat, Jan 6, 2024 at 5:22 AM Andrew Lamb <al...@influxdata.com> wrote: > > Hi Elliot, > > Given your description, I agree extension types sound like they may be a > good idea, similar to geoarrow[1] for Geospatial data where there is extra > metadata[2] needed to interpret underlying types (e.g. factor and offset) > > Andrew > > [1] https://github.com/geoarrow/geoarrow > [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow > > On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA) > <elliot.morrison-r...@us.bosch.com.invalid> wrote: > > > Background > > > > I have been looking into using parquet files for storing and working with > > automotive data. One interesting thing about automotive data is that most > > communication happens on the CAN bus where we have extremely limited > > bandwidth. > > In order to encode "physical" values in a very space efficient way, we > > use linear conversion formulas that look like "phys = (raw * factor) + > > offset". > > This gives implicit range and resolution limits, but that is often just > > fine > > when we are representing a physical property. > > > > Example 1: > > > > We have a throttle that can be anywhere from 0-100% and we want to fit that > > value into 1 byte. So we would use a formula like: > > > > phys = (raw * 0.39215) + 0 > > > > Example 2: > > > > We want to record ambient temperature of the vehicle. Resolution of 1 > > degree is > > fine. Also, temperatures below -40 and above 215 degrees C are not > > particularly > > useful as they are very rare and out of scope for a useful temperature. > > > > phys = (raw * 1.0) - 40 > > > > So far, I have been converting the raw data into floating point data before > > writing to arrow format to make it easier for the analysts to use the > > data. This of course means that I am converting to a less efficient format > > and I > > am also losing inherent information about the raw signal. I would rather > > be able > > to store the raw data in an appropriately sized unsigned integer and > > automatically convert to floating point when using the data, similar to > > dictionary encoding. > > > > Discussion > > > > - How would people generally deal with this situation using the arrow > > format? > > - Is this something that other people are interested in? > > - If this were to be added to the spec, what would be the best way to do > > it? > > > > While I am coming from an automotive perspective, I think there are many > > other > > areas of applicability (reading sensor data through an ADC, industrial > > automation and monitoring, etc.) > > > > I could see this working as either a new primitive type (similar to > > decimal), or > > as an extension where we simply put the factor and offset as standard > > metadata > > fields. > > > > Best regards, > > Elliot Morrison-Reed > > > >