Thanks for the hint.

After reading through the geoarrow spec, I think I agree that this is probably 
the best approach.

As far as I can tell all that is required is a standardized set of metadata 
tags and then some well implemented compute functions that can easily project 
the raw to physical interpretations.

Where I am struggling a little bit is to understand at what level those compute 
functions should be implemented. As far as I can tell, when I load a dictionary 
encoded arrow into a Pandas data frame or made a query using DataFusion, the 
user can then just operate as if they are working directly with a string array. 
Is that implemented in the arrow libraries, or does each "application" (pandas, 
DataFusion, etc.) have their own implementation?

Best regards,
Elliot Morrison-Reed

-----Original Message-----
From: Andrew Lamb <al...@influxdata.com>
Sent: Saturday, January 6, 2024 8:22 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] Linear Formula Types

Hi Elliot,

Given your description, I agree extension types sound like they may be a good 
idea, similar to geoarrow[1] for Geospatial data where there is extra 
metadata[2] needed to interpret underlying types (e.g. factor and offset)

Andrew

[1] https://github.com/geoarrow/geoarrow
[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow

On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA) 
<elliot.morrison-r...@us.bosch.com.invalid> wrote:

> Background
>
> I have been looking into using parquet files for storing and working
> with automotive data. One interesting thing about automotive data is
> that most communication happens on the CAN bus where we have extremely
> limited bandwidth.
> In order to encode "physical" values in a very space efficient way, we
> use linear conversion formulas that look like "phys = (raw * factor) +
> offset".
> This gives implicit range and resolution limits, but that is often
> just fine when we are representing a physical property.
>
> Example 1:
>
> We have a throttle that can be anywhere from 0-100% and we want to fit
> that value into 1 byte. So we would use a formula like:
>
>     phys = (raw * 0.39215) + 0
>
> Example 2:
>
> We want to record ambient temperature of the vehicle. Resolution of 1
> degree is fine. Also, temperatures below -40 and above 215 degrees C
> are not particularly useful as they are very rare and out of scope for
> a useful temperature.
>
>     phys = (raw * 1.0) - 40
>
> So far, I have been converting the raw data into floating point data
> before writing to arrow format to make it easier for the analysts to
> use the data. This of course means that I am converting to a less
> efficient format and I am also losing inherent information about the
> raw signal. I would rather be able to store the raw data in an
> appropriately sized unsigned integer and automatically convert to
> floating point when using the data, similar to dictionary encoding.
>
> Discussion
>
> - How would people generally deal with this situation using the arrow
> format?
> - Is this something that other people are interested in?
> - If this were to be added to the spec, what would be the best way to
> do it?
>
> While I am coming from an automotive perspective, I think there are
> many other areas of applicability (reading sensor data through an ADC,
> industrial automation and monitoring, etc.)
>
> I could see this working as either a new primitive type (similar to
> decimal), or as an extension where we simply put the factor and offset
> as standard metadata fields.
>
> Best regards,
> Elliot Morrison-Reed
>
>

Reply via email to