RE: Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-29 Thread Vyas Ramasubramani
Matt already mentioned this earlier (thanks Matt!), but I wanted to add another voice from RAPIDS saying that the new representation should work fine for libcudf and would certainly be helpful. On 2024/07/25 13:48:32 Joel Lubinitsky wrote: > Thank you everyone for contributing to this discussion

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-25 Thread Joel Lubinitsky
Thank you everyone for contributing to this discussion. I'd like to summarize where I think we've landed at this point: - After considering pros/cons of first-class vs canonical extension type and historical precedent, adopting Bool8 as a canonical extension type seems reasonable for this proposal

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Wes McKinney
>From a historical perspective, if we had had extension types / canonical extension types, it would have made more sense to have the millisecond dates as an extension type. The goal of having the extra type was to avoid an unnecessary serialization in systems where there is a benefit to moving dat

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Antoine Pitrou
Le 22/07/2024 à 21:25, Joel Lubinitsky a écrit : If Canonical Extensions had existed at the time, I think there's a chance we may have ended up with int32 Date as a first class type and int64 MillisecondDate as a Canonical Extension type. Agreed. Are there any lessons we've learned from im

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Joel Lubinitsky
> As a counterpoint we have two different units for date [1] that don't > really convey meaningful new information. Implementations have to deal > with this somehow, and I think the only reason this exists is effectively > to support different type systems. Thanks for raising this Micah, The pre

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
On Fri, 19 Jul 2024 10:03:48 -0300 Dewey Dunnington wrote: > The extension-ness of it is a valid point...all the other cases where > we have multiple Arrow types for the same element type (e.g., String, > LargeString, StringView) are first-class types. For a Bool8, the > tradeoffs are roughly the

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Micah Kornfield
I'm fine in principle with this being an extension type I just want to make sure we had this conversation. Some replies inline. I think it would confuse implementors of the spec and people implementing > kernels way too much. “the bool Arrow type” should probably not start > meaning two different

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Dewey Dunnington
The extension-ness of it is a valid point...all the other cases where we have multiple Arrow types for the same element type (e.g., String, LargeString, StringView) are first-class types. For a Bool8, the tradeoffs are roughly the same (less support for StringView and LargeString, more space requir

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
Agreed with Felipe. This is meant for communicating with non-Arrow type systems, but shouldn't be regarded as an alternative first-class boolean type. Regards Antoine. Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit : I think it would confuse implementors of the spec and people i

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-18 Thread Felipe Oliveira Carvalho
I think it would confuse implementors of the spec and people implementing kernels way too much. “the bool Arrow type” should probably not start meaning two different things. — Felipe On Fri, 19 Jul 2024 at 01:26 Micah Kornfield wrote: > As Boolean is already in the arrow type system I think it

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-18 Thread Micah Kornfield
As Boolean is already in the arrow type system I think it might be worth asking the question as to whether this should be an extension type or a first class type. Given what I think of the last discussion on the trade-offs [1], I think there is room for debate here, since Boolean is not currently

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Alenka Frim
Thank you Joel for working on this! I have also came across the need for a byte packed boolean support when implementing the Python dataframe interchange protocol and also DPack which is implemented in Arrow C++. The extension type is a great solution. I will comment on the PR if I have any questi

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Ian Cook
Thanks Joel and Matt. This looks good to me. I think it's worth saying here that Arrow-producing components should still by default emit Booleans in the standard bit-packed Arrow layout. This proposed bool8 canonical extension type is intended to be used in applications where the producer knows th

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Matt Topol
Just chiming in that the libcudf documentation[1] states that this proposal should work just fine. Bool8 type is described as "0 == false, else true". --Matt [1]: https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539 On Wed, Jul 17, 2024, 3:18 PM Joel

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Joel Lubinitsky
Thank you for your comments. I spent some time trying to confirm definitively that this proposal would enable zero copy sharing both ways between pyarrow and numpy. I put together the following gist [1] with my experiment. To summarize the results: - I was able to share the underlying value buffe

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Ian Cook
>> Before the vote, I would like to see verification that this truly enables >> zero-copy to/from NumPy bool arrays in Python. > I think this is an implementation issue more than a specification issue...I am not personally worried about any provisions on the specification that might make this impo

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-17 Thread Dewey Dunnington
Thank you for this! I have definitely run across the one-byte-per-item bool in numpy, DuckDB, and cudf. I haven't heard any discussion about DuckDB here but I am fairly sure that they represent their boolean type as an int8 as well [1]. > Before the vote, I would like to see verification that this

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Antoine Pitrou
Hi Joel, This looks good to me on the principle. Can you split the spec and the implementation(s) into separate PRs? Regards Antoine. Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a discuss

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Ian Cook
Thanks for taking the initiative on this! As demonstrated by [1], the wish for an 8-bit Boolean extension type is long-standing. I think this is a worthwhile addition to Arrow's canonical extension types. Before the vote, I would like to see verification that this truly enables zero-copy to/from

[DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Joel Lubinitsky
Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a discussion about it here because it could be valuable to others if adopted as a canonical extension type. The native implementation of the Boolean type uses 1 bit to encode each value, enabling a very