Thanks for the hint. After reading through the geoarrow spec, I think I agree that this is probably the best approach.
As far as I can tell all that is required is a standardized set of metadata tags and then some well implemented compute functions that can easily project the raw to physical interpretations. Where I am struggling a little bit is to understand at what level those compute functions should be implemented. As far as I can tell, when I load a dictionary encoded arrow into a Pandas data frame or made a query using DataFusion, the user can then just operate as if they are working directly with a string array. Is that implemented in the arrow libraries, or does each "application" (pandas, DataFusion, etc.) have their own implementation? Best regards, Elliot Morrison-Reed -----Original Message----- From: Andrew Lamb <al...@influxdata.com> Sent: Saturday, January 6, 2024 8:22 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] Linear Formula Types Hi Elliot, Given your description, I agree extension types sound like they may be a good idea, similar to geoarrow[1] for Geospatial data where there is extra metadata[2] needed to interpret underlying types (e.g. factor and offset) Andrew [1] https://github.com/geoarrow/geoarrow [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA) <elliot.morrison-r...@us.bosch.com.invalid> wrote: > Background > > I have been looking into using parquet files for storing and working > with automotive data. One interesting thing about automotive data is > that most communication happens on the CAN bus where we have extremely > limited bandwidth. > In order to encode "physical" values in a very space efficient way, we > use linear conversion formulas that look like "phys = (raw * factor) + > offset". > This gives implicit range and resolution limits, but that is often > just fine when we are representing a physical property. > > Example 1: > > We have a throttle that can be anywhere from 0-100% and we want to fit > that value into 1 byte. So we would use a formula like: > > phys = (raw * 0.39215) + 0 > > Example 2: > > We want to record ambient temperature of the vehicle. Resolution of 1 > degree is fine. Also, temperatures below -40 and above 215 degrees C > are not particularly useful as they are very rare and out of scope for > a useful temperature. > > phys = (raw * 1.0) - 40 > > So far, I have been converting the raw data into floating point data > before writing to arrow format to make it easier for the analysts to > use the data. This of course means that I am converting to a less > efficient format and I am also losing inherent information about the > raw signal. I would rather be able to store the raw data in an > appropriately sized unsigned integer and automatically convert to > floating point when using the data, similar to dictionary encoding. > > Discussion > > - How would people generally deal with this situation using the arrow > format? > - Is this something that other people are interested in? > - If this were to be added to the spec, what would be the best way to > do it? > > While I am coming from an automotive perspective, I think there are > many other areas of applicability (reading sensor data through an ADC, > industrial automation and monitoring, etc.) > > I could see this working as either a new primitive type (similar to > decimal), or as an extension where we simply put the factor and offset > as standard metadata fields. > > Best regards, > Elliot Morrison-Reed > >