Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Joel Lubinitsky Mon, 22 Jul 2024 12:26:36 -0700

> As a counterpoint we have two different units for date [1] that don't
> really convey meaningful new information.  Implementations have to deal
> with this somehow, and I think the only reason this exists is effectively
> to support different type systems.


Thanks for raising this Micah,

The previous discussion [2] that you linked refers to a notion of
"grandfathering" certain types that existed before Canonical Extensions
were formalized. I think the Date variations may fall into this category
and could have been added as Canonical Extensions if they had existed at
the time.

However, this does raise an interesting parallel. The Date type with DAY
units vs MILLISECOND units seems like a close analogy to the current
discussion of Boolean with bit vs byte representations:
- DAY is a simpler, more compact representation of days elapsed. 1-bit
bools are like this, though perhaps "simpler" can be subjective.
- MILLISECOND is a larger representation requiring additional
rules/semantics for interpretation, but can improve interoperability with
some existing systems. 8-bit bools are like this.

Can we pull from our experience implementing and maintaining
DAY+MILLISECOND variants of Date as first-class types to decide whether
this would be useful for Booleans as well?

If Canonical Extensions had existed at the time, I think there's a chance
we may have ended up with int32 Date as a first class type and int64
MillisecondDate as a Canonical Extension type. Are there any lessons we've
learned from implementing both as first-class types as opposed to this
hypothetical first-class / extension split? If we implement Bool8 as an
extension type, we'll certainly learn some lessons as well but they'll
likely be more "reversible". The path of promoting it to a first-class type
in the future remains an option too. In fact even if we were to plan on
this being a first-class type, it might still make sense to allow it to
"incubate" as a canonical extension in any case to avoid pushing an
irreversible change on maintainers without some initial real-world usage.

Thanks,
Joel


[1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L251
[2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t

On Fri, Jul 19, 2024 at 4:35 PM Antoine Pitrou <[email protected]> wrote:

> On Fri, 19 Jul 2024 10:03:48 -0300
> Dewey Dunnington <[email protected]> wrote:
>
> > The extension-ness of it is a valid point...all the other cases where
> > we have multiple Arrow types for the same element type (e.g., String,
> > LargeString, StringView) are first-class types. For a Bool8, the
> > tradeoffs are roughly the same (less support for StringView and
> > LargeString, more space required for LargeString, etc.).
>
> All these string types have different intrinsic qualities that warrant
> having them as first-class types (e.g. String is more compact than
> LargeString but cannot represent very large data, etc.).
>
> Conversely, Bool8 doesn't have any intrinsic qualities that plain
> Boolean doesn't have, AFAICT. Boolean can represent the same data, is
> more compact, and is already widely supported.
>
> Regards
>
> Antoine.
>
>
>
> >
> > For me the choice of whether or not to have this be a first-class type
> > or an extension type is just because there is no change required in
> > Schema.fbs/existing implementations can pass through instances of the
> > type without modification (as long as they support extension types). I
> > believe there was some consensus on a previous thread that I can't
> > find now that new types should be implemented as extension types if
> > possible for these (and perhaps other) reasons.
> >
> >
> > On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <[email protected]>
> wrote:
> > >
> > >
> > > Agreed with Felipe. This is meant for communicating with non-Arrow type
> > > systems, but shouldn't be regarded as an alternative first-class
> boolean
> > > type.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit :
> > > > I think it would confuse implementors of the spec and people
> implementing
> > > > kernels way too much. “the bool Arrow type” should probably not start
> > > > meaning two different things.
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <[email protected]>
> wrote:
> > > >
> > > >> As Boolean is already in the arrow type system I think it might be
> worth
> > > >> asking the question as to whether this should be an extension type
> or a
> > > >> first class type.
> > > >>
> > > >> Given what I think of the  last discussion on the trade-offs [1], I
> think
> > > >> there is room for debate here, since Boolean is not currently
> > > >> parameterized, adding it as an existing type would require a new
> top level
> > > >> type.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1]
> https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> > > >>
> > > >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <[email protected]>
> wrote:
> > > >>
> > > >>> Thank you Joel for working on this! I have also came across
> > > >>> the need for a byte packed boolean support when implementing the
> > > >>> Python dataframe interchange protocol and also DPack which
> > > >>> is implemented in Arrow C++. The extension type is a great
> solution.
> > > >>>
> > > >>> I will comment on the PR if I have any questions.
> > > >>>
> > > >>> Alenka
> > > >>>
> > > >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <
> [email protected]>
> > > >>> napisala:
> > > >>>
> > > >>>> Thanks Joel and Matt. This looks good to me.
> > > >>>>
> > > >>>> I think it's worth saying here that Arrow-producing components
> should
> > > >>> still
> > > >>>> by default emit Booleans in the standard bit-packed Arrow layout.
> This
> > > >>>> proposed bool8 canonical extension type is intended to be used in
> > > >>>> applications where the producer knows that the consumer can
> correctly
> > > >>>> interpret the bool8 extension type and where using it is more
> efficient
> > > >>>> than converting the data to the standard bit-packed layout.
> > > >>>>
> > > >>>> Ian
> > > >>>>
> > > >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <
> [email protected]>
> > > >>> wrote:
> > > >>>>
> > > >>>>> Just chiming in that the libcudf documentation[1] states that
> this
> > > >>>> proposal
> > > >>>>> should work just fine. Bool8 type is described as "0 == false,
> else
> > > >>>> true".
> > > >>>>>
> > > >>>>> --Matt
> > > >>>>>
> > > >>>>> [1]:
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539
>
> > > >>>>>
> > > >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <
> [email protected]>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>>> Thank you for your comments.
> > > >>>>>>
> > > >>>>>> I spent some time trying to confirm definitively that this
> proposal
> > > >>>> would
> > > >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I
> put
> > > >>>>>> together the following gist [1] with my experiment.
> > > >>>>>>
> > > >>>>>> To summarize the results:
> > > >>>>>> - I was able to share the underlying value buffer both ways
> and
> > > >> have
> > > >>> it
> > > >>>>> be
> > > >>>>>> interpreted correctly in each case.
> > > >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False
> or
> > > >>>> True.
> > > >>>>>> Importantly, numpy will also understand values outside this
> range
> > > >> to
> > > >>>> mean
> > > >>>>>> True without requiring a copy. This tracks closely with the
> > > >> proposed
> > > >>>>>> semantics.
> > > >>>>>>
> > > >>>>>> [1]:
> > > >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b
>
> > > >>>>>>
> > > >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <[email protected]>
>
> > > >>> wrote:
> > > >>>>>>
> > > >>>>>>>>> Before the vote, I would like to see verification that this
> > > >>> truly
> > > >>>>>>> enables
> > > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> > > >>>>>>>
> > > >>>>>>>> I think this is an implementation issue more than a
> > > >> specification
> > > >>>>>>> issue...I am not personally worried about any provisions on the
> > > >>>>>>> specification that might make this impossible.
> > > >>>>>>>
> > > >>>>>>> To clarify, what I am looking for here is definite
> confirmation
> > > >>> that
> > > >>>>>>> the proposed representation (in which a signed int8 zero
> value
> > > >>>>> indicates
> > > >>>>>>> False and any non-zero signed int8 value indicates True)
> > > >>> corresponds
> > > >>>> to
> > > >>>>>> the
> > > >>>>>>> representation used by NumPy such that bidirectional zero-copy
> is
> > > >>>> made
> > > >>>>>>> possible. This seems to me like a specification issue.
> > > >>>>>>>
> > > >>>>>>> Ian
> > > >>>>>>>
> > > >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
> > > >>>>>>> <[email protected]> wrote:
> > > >>>>>>>
> > > >>>>>>>> Thank you for this! I have definitely run across the
> > > >>>>> one-byte-per-item
> > > >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any
> discussion
> > > >>>> about
> > > >>>>>>>> DuckDB here but I am fairly sure that they represent their
> > > >>> boolean
> > > >>>>>>>> type as an int8 as well [1].
> > > >>>>>>>>
> > > >>>>>>>>> Before the vote, I would like to see verification that this
> > > >>> truly
> > > >>>>>>> enables
> > > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> > > >>>>>>>>
> > > >>>>>>>> I think this is an implementation issue more than a
> > > >> specification
> > > >>>>>>>> issue...I am not personally worried about any provisions on
> the
> > > >>>>>>>> specification that might make this impossible.
> > > >>>>>>>>
> > > >>>>>>>> -dewey
> > > >>>>>>>>
> > > >>>>>>>> [1]
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
>
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <
> > > >>>> [email protected]>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Joel,
> > > >>>>>>>>>
> > > >>>>>>>>> This looks good to me on the principle. Can you split the
> > > >> spec
> > > >>>> and
> > > >>>>>> the
> > > >>>>>>>>> implementation(s) into separate PRs?
> > > >>>>>>>>>
> > > >>>>>>>>> Regards
> > > >>>>>>>>>
> > > >>>>>>>>> Antoine.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
> > > >>>>>>>>>> Hi Arrow devs,
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'm working on adding an extension type for 8-bit
> booleans,
> > > >>> and
> > > >>>>>>> wanted
> > > >>>>>>>> to
> > > >>>>>>>>>> start a discussion about it here because it could be
> > > >> valuable
> > > >>>> to
> > > >>>>>>>> others if
> > > >>>>>>>>>> adopted as a canonical extension type.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The native implementation of the Boolean type uses 1 bit
> to
> > > >>>>> encode
> > > >>>>>>> each
> > > >>>>>>>>>> value, enabling a very compact representation. This is
> > > >>>> favorable
> > > >>>>>> for
> > > >>>>>>>> many
> > > >>>>>>>>>> workloads, but lots of systems that want to
> produce/consume
> > > >>>>> Boolean
> > > >>>>>>>> arrays
> > > >>>>>>>>>> use an 8-bit representation internally and are forced to
> > > >>>>>> copy/convert
> > > >>>>>>>> at
> > > >>>>>>>>>> their periphery. For these scenarios where zero-copy
> > > >>>>> compatibility
> > > >>>>>> is
> > > >>>>>>>>>> important, the 8-bit representation of boolean values may
> > > >> be
> > > >>>>>>> preferred.
> > > >>>>>>>>>> This can benefit interactions with existing libraries that
> > > >>>> avoid
> > > >>>>>>>> packing
> > > >>>>>>>>>> column data like 1-bit booleans for parallelization
> > > >> purposes,
> > > >>>>>>>> including GPU
> > > >>>>>>>>>> libraries such as libcudf. The original issue [1]
> > > >> identifies
> > > >>>>> numpy
> > > >>>>>>>>>> conversion as a specific use-case as well.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The details of the extension type can be found in the
> draft
> > > >>> PR
> > > >>>>> [2]
> > > >>>>>>>> which
> > > >>>>>>>>>> contains a Go implementation (WIP) and an update to the
> > > >>>>>> documentation
> > > >>>>>>>> for
> > > >>>>>>>>>> canonical extension types. I plan to add a C++
> > > >> implementation
> > > >>>> as
> > > >>>>>> well
> > > >>>>>>>> but
> > > >>>>>>>>>> wanted to open this discussion first.
> > > >>>>>>>>>>
> > > >>>>>>>>>> A quick overview of the layout / semantics proposed in the
> > > >>> PR:
> > > >>>>>>>>>> Storage Type: Int8
> > > >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all
> > > >>>> seems
> > > >>>>>>>> reasonable
> > > >>>>>>>>>> then I'll move forward with the next implementation and
> > > >> open
> > > >>> up
> > > >>>>>>> another
> > > >>>>>>>>>> proposal for a formal vote. Thanks!
> > > >>>>>>>>>>
> > > >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682
> > > >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> >
>
>
>
>

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Reply via email to