On Fri, 19 Jul 2024 10:03:48 -0300
Dewey Dunnington <de...@voltrondata.com.INVALID> wrote:

> The extension-ness of it is a valid point...all the other cases where
> we have multiple Arrow types for the same element type (e.g., String,
> LargeString, StringView) are first-class types. For a Bool8, the
> tradeoffs are roughly the same (less support for StringView and
> LargeString, more space required for LargeString, etc.).

All these string types have different intrinsic qualities that warrant
having them as first-class types (e.g. String is more compact than
LargeString but cannot represent very large data, etc.).

Conversely, Bool8 doesn't have any intrinsic qualities that plain
Boolean doesn't have, AFAICT. Boolean can represent the same data, is
more compact, and is already widely supported.

Regards

Antoine.



> 
> For me the choice of whether or not to have this be a first-class type
> or an extension type is just because there is no change required in
> Schema.fbs/existing implementations can pass through instances of the
> type without modification (as long as they support extension types). I
> believe there was some consensus on a previous thread that I can't
> find now that new types should be implemented as extension types if
> possible for these (and perhaps other) reasons.
> 
> 
> On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Agreed with Felipe. This is meant for communicating with non-Arrow type
> > systems, but shouldn't be regarded as an alternative first-class boolean
> > type.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit :  
> > > I think it would confuse implementors of the spec and people implementing
> > > kernels way too much. “the bool Arrow type” should probably not start
> > > meaning two different things.
> > >
> > > —
> > > Felipe
> > >
> > > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> 
> > > wrote:
> > >  
> > >> As Boolean is already in the arrow type system I think it might be worth
> > >> asking the question as to whether this should be an extension type or a
> > >> first class type.
> > >>
> > >> Given what I think of the  last discussion on the trade-offs [1], I think
> > >> there is room for debate here, since Boolean is not currently
> > >> parameterized, adding it as an existing type would require a new top 
> > >> level
> > >> type.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> > >>
> > >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> 
> > >> wrote:
> > >>  
> > >>> Thank you Joel for working on this! I have also came across
> > >>> the need for a byte packed boolean support when implementing the
> > >>> Python dataframe interchange protocol and also DPack which
> > >>> is implemented in Arrow C++. The extension type is a great solution.
> > >>>
> > >>> I will comment on the PR if I have any questions.
> > >>>
> > >>> Alenka
> > >>>
> > >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org>
> > >>> napisala:
> > >>>  
> > >>>> Thanks Joel and Matt. This looks good to me.
> > >>>>
> > >>>> I think it's worth saying here that Arrow-producing components should  
> > >>> still  
> > >>>> by default emit Booleans in the standard bit-packed Arrow layout. This
> > >>>> proposed bool8 canonical extension type is intended to be used in
> > >>>> applications where the producer knows that the consumer can correctly
> > >>>> interpret the bool8 extension type and where using it is more efficient
> > >>>> than converting the data to the standard bit-packed layout.
> > >>>>
> > >>>> Ian
> > >>>>
> > >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com>  
> > >>> wrote:  
> > >>>>  
> > >>>>> Just chiming in that the libcudf documentation[1] states that this  
> > >>>> proposal  
> > >>>>> should work just fine. Bool8 type is described as "0 == false, else  
> > >>>> true".  
> > >>>>>
> > >>>>> --Matt
> > >>>>>
> > >>>>> [1]:
> > >>>>>
> > >>>>>  
> > >>>>  
> > >>>  
> > >> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539
> > >>   
> > >>>>>
> > >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com>  
> > >>>> wrote:  
> > >>>>>  
> > >>>>>> Thank you for your comments.
> > >>>>>>
> > >>>>>> I spent some time trying to confirm definitively that this proposal  
> > >>>> would  
> > >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I put
> > >>>>>> together the following gist [1] with my experiment.
> > >>>>>>
> > >>>>>> To summarize the results:
> > >>>>>> - I was able to share the underlying value buffer both ways and  
> > >> have  
> > >>> it  
> > >>>>> be  
> > >>>>>> interpreted correctly in each case.
> > >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False or  
> > >>>> True.  
> > >>>>>> Importantly, numpy will also understand values outside this range  
> > >> to  
> > >>>> mean  
> > >>>>>> True without requiring a copy. This tracks closely with the  
> > >> proposed  
> > >>>>>> semantics.
> > >>>>>>
> > >>>>>> [1]:  
> > >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b  
> > >>>>>>
> > >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org>  
> > >>> wrote:  
> > >>>>>>  
> > >>>>>>>>> Before the vote, I would like to see verification that this  
> > >>> truly  
> > >>>>>>> enables  
> > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.  
> > >>>>>>>  
> > >>>>>>>> I think this is an implementation issue more than a  
> > >> specification  
> > >>>>>>> issue...I am not personally worried about any provisions on the
> > >>>>>>> specification that might make this impossible.
> > >>>>>>>
> > >>>>>>> To clarify, what I am looking for here is definite confirmation  
> > >>> that  
> > >>>>>>> the proposed representation (in which a signed int8 zero value  
> > >>>>> indicates  
> > >>>>>>> False and any non-zero signed int8 value indicates True)  
> > >>> corresponds  
> > >>>> to  
> > >>>>>> the  
> > >>>>>>> representation used by NumPy such that bidirectional zero-copy is  
> > >>>> made  
> > >>>>>>> possible. This seems to me like a specification issue.
> > >>>>>>>
> > >>>>>>> Ian
> > >>>>>>>
> > >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
> > >>>>>>> <de...@voltrondata.com.invalid> wrote:
> > >>>>>>>  
> > >>>>>>>> Thank you for this! I have definitely run across the  
> > >>>>> one-byte-per-item  
> > >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any discussion  
> > >>>> about  
> > >>>>>>>> DuckDB here but I am fairly sure that they represent their  
> > >>> boolean  
> > >>>>>>>> type as an int8 as well [1].
> > >>>>>>>>  
> > >>>>>>>>> Before the vote, I would like to see verification that this  
> > >>> truly  
> > >>>>>>> enables  
> > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.  
> > >>>>>>>>
> > >>>>>>>> I think this is an implementation issue more than a  
> > >> specification  
> > >>>>>>>> issue...I am not personally worried about any provisions on the
> > >>>>>>>> specification that might make this impossible.
> > >>>>>>>>
> > >>>>>>>> -dewey
> > >>>>>>>>
> > >>>>>>>> [1]
> > >>>>>>>>  
> > >>>>>>>  
> > >>>>>>  
> > >>>>>  
> > >>>>  
> > >>>  
> > >> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
> > >>   
> > >>>>>>>>
> > >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <  
> > >>>> anto...@python.org>  
> > >>>>>>>> wrote:  
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Joel,
> > >>>>>>>>>
> > >>>>>>>>> This looks good to me on the principle. Can you split the  
> > >> spec  
> > >>>> and  
> > >>>>>> the  
> > >>>>>>>>> implementation(s) into separate PRs?
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>>
> > >>>>>>>>> Antoine.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :  
> > >>>>>>>>>> Hi Arrow devs,
> > >>>>>>>>>>
> > >>>>>>>>>> I'm working on adding an extension type for 8-bit booleans,  
> > >>> and  
> > >>>>>>> wanted  
> > >>>>>>>> to  
> > >>>>>>>>>> start a discussion about it here because it could be  
> > >> valuable  
> > >>>> to  
> > >>>>>>>> others if  
> > >>>>>>>>>> adopted as a canonical extension type.
> > >>>>>>>>>>
> > >>>>>>>>>> The native implementation of the Boolean type uses 1 bit to  
> > >>>>> encode  
> > >>>>>>> each  
> > >>>>>>>>>> value, enabling a very compact representation. This is  
> > >>>> favorable  
> > >>>>>> for  
> > >>>>>>>> many  
> > >>>>>>>>>> workloads, but lots of systems that want to produce/consume  
> > >>>>> Boolean  
> > >>>>>>>> arrays  
> > >>>>>>>>>> use an 8-bit representation internally and are forced to  
> > >>>>>> copy/convert  
> > >>>>>>>> at  
> > >>>>>>>>>> their periphery. For these scenarios where zero-copy  
> > >>>>> compatibility  
> > >>>>>> is  
> > >>>>>>>>>> important, the 8-bit representation of boolean values may  
> > >> be  
> > >>>>>>> preferred.  
> > >>>>>>>>>> This can benefit interactions with existing libraries that  
> > >>>> avoid  
> > >>>>>>>> packing  
> > >>>>>>>>>> column data like 1-bit booleans for parallelization  
> > >> purposes,  
> > >>>>>>>> including GPU  
> > >>>>>>>>>> libraries such as libcudf. The original issue [1]  
> > >> identifies  
> > >>>>> numpy  
> > >>>>>>>>>> conversion as a specific use-case as well.
> > >>>>>>>>>>
> > >>>>>>>>>> The details of the extension type can be found in the draft  
> > >>> PR  
> > >>>>> [2]  
> > >>>>>>>> which  
> > >>>>>>>>>> contains a Go implementation (WIP) and an update to the  
> > >>>>>> documentation  
> > >>>>>>>> for  
> > >>>>>>>>>> canonical extension types. I plan to add a C++  
> > >> implementation  
> > >>>> as  
> > >>>>>> well  
> > >>>>>>>> but  
> > >>>>>>>>>> wanted to open this discussion first.
> > >>>>>>>>>>
> > >>>>>>>>>> A quick overview of the layout / semantics proposed in the  
> > >>> PR:  
> > >>>>>>>>>> Storage Type: Int8
> > >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true
> > >>>>>>>>>>
> > >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all  
> > >>>> seems  
> > >>>>>>>> reasonable  
> > >>>>>>>>>> then I'll move forward with the next implementation and  
> > >> open  
> > >>> up  
> > >>>>>>> another  
> > >>>>>>>>>> proposal for a formal vote. Thanks!
> > >>>>>>>>>>
> > >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682
> > >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234
> > >>>>>>>>>>  
> > >>>>>>>>  
> > >>>>>>>  
> > >>>>>>  
> > >>>>>  
> > >>>>  
> > >>>  
> > >>  
> > >  
> 



Reply via email to