Re: [DISCUSS] Linear Formula Types

Julian Hyde Sun, 07 Jan 2024 14:26:36 -0800

If the DB layer above Arrow supports it, I would define a (non-stored)
calculated column. Given celsius_percent between 0 and 1, I would
define fahrenheit as (32 + celsius_percent * 1.8). A good query
optimizer would convert the condition 'where fahrenheit > 122' into
'where celsius_percent > 0.5'.


Extension types would work too, but calculations seem simpler.

On Sat, Jan 6, 2024 at 5:22 AM Andrew Lamb <[email protected]> wrote:
>
> Hi Elliot,
>
> Given your description, I agree extension types sound like they may be a
> good idea, similar to geoarrow[1] for Geospatial data where there is extra
> metadata[2] needed to interpret underlying types (e.g. factor and offset)
>
> Andrew
>
> [1] https://github.com/geoarrow/geoarrow
> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow
>
> On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA)
> <[email protected]> wrote:
>
> > Background
> >
> > I have been looking into using parquet files for storing and working with
> > automotive data. One interesting thing about automotive data is that most
> > communication happens on the CAN bus where we have extremely limited
> > bandwidth.
> > In order to encode "physical" values in a very space efficient way, we
> > use linear conversion formulas that look like "phys = (raw * factor) +
> > offset".
> > This gives implicit range and resolution limits, but that is often just
> > fine
> > when we are representing a physical property.
> >
> > Example 1:
> >
> > We have a throttle that can be anywhere from 0-100% and we want to fit that
> > value into 1 byte. So we would use a formula like:
> >
> >     phys = (raw * 0.39215) + 0
> >
> > Example 2:
> >
> > We want to record ambient temperature of the vehicle. Resolution of 1
> > degree is
> > fine. Also, temperatures below -40 and above 215 degrees C are not
> > particularly
> > useful as they are very rare and out of scope for a useful temperature.
> >
> >     phys = (raw * 1.0) - 40
> >
> > So far, I have been converting the raw data into floating point data before
> > writing to arrow format to make it easier for the analysts to use the
> > data. This of course means that I am converting to a less efficient format
> > and I
> > am also losing inherent information about the raw signal. I would rather
> > be able
> > to store the raw data in an appropriately sized unsigned integer and
> > automatically convert to floating point when using the data, similar to
> > dictionary encoding.
> >
> > Discussion
> >
> > - How would people generally deal with this situation using the arrow
> > format?
> > - Is this something that other people are interested in?
> > - If this were to be added to the spec, what would be the best way to do
> > it?
> >
> > While I am coming from an automotive perspective, I think there are many
> > other
> > areas of applicability (reading sensor data through an ADC, industrial
> > automation and monitoring, etc.)
> >
> > I could see this working as either a new primitive type (similar to
> > decimal), or
> > as an extension where we simply put the factor and offset as standard
> > metadata
> > fields.
> >
> > Best regards,
> > Elliot Morrison-Reed
> >
> >

Re: [DISCUSS] Linear Formula Types

Reply via email to