Hi,

Thanks a lot for the feedback. Atm I was really just trying to get whether
others also saw these types as these packed structs.

Wrt to the extension type: I am not sure we can make it fast, though: the
interpretation of the bytes would need to be done dynamically (instead of
statically) because we can't compile the struct prior to receiving it (via
IPC or FFI). This interpretation would be part of hot loops (as we would
need to interpret the bytes on every element).

For this to work efficiently, IMO we would need some kind of "c extension"
whereby people could declare a c struct as part of the extension, which
consumers would compile to their own language for consumption. My
understanding is that in essence this is what we have been doing for the
interval types when we write things like

"A triple of the number of elapsed months, days, and nanoseconds.
//  The values are stored contiguously in 16 byte blocks. Months and
//  days are encoded as 32 bit integers and nanoseconds is encoded as a
//  64 bit integer. All integers are signed."

declare the struct, which implementations hard-code on their source code.

It is interesting that these resemble the idea of protobuf and thrift but
at the intra-process level (FFI).

Micah, I was thinking about the page with the memory layout [1],
specifically the primitive section, where some mental effort is required to
interpret the interval types as primitives (but not the FixedSizeBinary);
my understanding is that the former has a known packed struct while the
later does not.

Best,
Jorge

[1]
https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout





On Thu, Sep 2, 2021 at 4:45 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I agree, it is what I would have proposed for the interval type if there
> wasn't an interval type in Arrow already.  I think FixedSizeList has for
> better or worse solved a lot of the problems that a struct type would be
> used for (e.g. coordinates)
>
> Cheers,
> Micah
>
> On Tue, Aug 31, 2021 at 8:27 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > I do still think that having a "packed C struct" type would be a
> > useful thing, but thus far no one has needed it enough to develop
> > something in the columnar format specification.
> >
> > On Tue, Aug 31, 2021 at 1:33 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > > Hi Jorge,
> > > Are there places in the docs that you think this would simplify?
> > > There is an old JIRA [1] about introducing a c-struct type that I
> > > think aligns with this observation [1]
> > >
> > > -Micah
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-1790
> > >
> > > On Mon, Aug 30, 2021 at 2:57 PM Jorge Cardoso Leitão
> > > <jorgecarlei...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Just came across this curiosity that IMO may help us to design
> physical
> > > > types in the future.
> > > >
> > > > Not sure if this was mentioned before, but it seems to me that
> > > > `DaysMilliseconds` and `MonthDayNano` belong to a broader class of
> > physical
> > > > types "typed tuples" in that they are constructed by defining the
> tuple
> > > > `(t_1,t_2,...,t_N)` where t_i (e.g. int32) is representable in memory
> > for a
> > > > given endianess, and each element of the array is written to the
> buffer
> > > > back to back as `<t1 in endianess><t2 in endianess>...<tN in
> > endianess>`.
> > > >
> > > > Primitive arrays such as e.g. `Int32Array` are the extreme case where
> > the
> > > > tuple has a single entry (t1,), which leads to `<int32 in
> endianess>`.
> > The
> > > > others are:
> > > > * DaysMilliseconds = (int32, int32)
> > > > * MonthDayNano = (int32, int32, int64)
> > > >
> > > > In principle, we could re-write the in-memory layout page in these
> > terms
> > > > that places all the types above in the same "bucket".
> > > >
> > > > Best,
> > > > Jorge
> >
>

Reply via email to