Re: [Vote][Format] C Data Interface Format string for REE
+1 (binding) On Wed, Aug 16, 2023 at 9:05 AM Jacob Quinn wrote: > > +1 (binding) > > Cheers, > > -Jacob > > On Wed, Aug 16, 2023 at 8:16 AM Matt Topol > wrote: > > > Hey All, > > > > As proposed by Felipe [1] I'm starting a vote on the proposed update to the > > Format Spec of adding "+r" as the format string for passing Run-End Encoded > > arrays through the Arrow C Data Interface. > > > > A PR containing an update to the C++ Arrow implementation to add support > > for this format string along with documentation updates can be found here > > [2]. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 - I'm in favor of this new C Data Format string > > [ ] +0 > > [ ] -1 - I'm against adding this new format string because > > > > Thanks everyone! > > > > --Matt > > > > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 > > [2]: https://github.com/apache/arrow/pull/37174 > >
Re: [DISCUSS][Arrow] Extension metadata encoding design
I realize it's a not-insignificant change, and I'm not (yet) proposing such a change without more discussion and thought into the consequences. But, I don't think this would actually break any protocol, so I don't want to prematurely preclude this as a possible future direction. My understanding is that Arrow implementations are required to preserve and pass along all unknown metadata. So if extensions started adding/reading extra metadata that would all just be handled transparently by any non-extension-aware code as it is today with no changes. In fact, as far as I can tell, there's not really anything precluding us from doing this today in our project internally -- the only limitation is that we can't make use of any existing library ExtensionType implementations and just have to implement our serializer / deserializer logic on top of the lower-level/raw APIs where we have access to the metadata. If we were to make such a change in the official implementation though, from a backwards-compatibility perspective, any existing extensions would continue using the "ARROW:extension:metadata" key, so legacy extension code would continue to be protocol-compatible. And from a code-migration perspective that could even continue to be passed through as a pre-extracted string with a combined interface like: ``` arrow::Result> Deserialize( std::shared_ptr storage_type, const std::string& serialized_data, std::shared_ptr metadata) const; ``` So that migrating any existing extensions would just be a matter of adding an unused parameter to their interface when updating to whatever version of arrow enabled this feature. I think the only material consequence would be that new extension types that use this feature would end up with a minimum-version of the arrow library they are compatible with. But that's not really any different than it is today -- no implementation is required to support any specific extension and extensions that have an internal dependency on new features obviously can't be used with old versions of the library. Best, Jeremy On Wed, Aug 16, 2023 at 5:53 PM Antoine Pitrou wrote: > > Hmm, you're right that letting the extension type peek at the entire > metadata values would have been another solution. > > That said, for protocol compatibility reasons, we cannot easily change > this anymore. > > Regards > > Antoine. > > > > Le 16/08/2023 à 17:48, Jeremy Leibs a écrit : > > Thanks for the context, Antoine. > > > > However, even in those examples, I don't really see how coercing the > > metadata to a single string makes much of a difference. > > I believe the main difference of what I'm proposing would be that the > > ExtensionType::Deserialize interface: > > https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51 > > > > Would instead look like: > > ``` > >arrow::Result> Deserialize( > >std::shared_ptr storage_type, > >std::shared_ptr metadata) const; > > ``` > > > > In both of those cases though it seems like a > > valid std::shared_ptr is available to be passed to the > > extension. > > > > I suspect the more challenging case might be related to DataType equality > > checks? It would not be possible for generic code to know whether it can > > validly do things like concatenate two extension arrays without knowledge > > of which metadata keys are relevant to the extension. That said, with > the > > current adhoc serialization of metadata to a string, different > > encoder-implementations still might still produce non-comparable strings, > > resulting in falsely reported datatype mismatches, but at least avoiding > > the case of false positives. > > > > On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou > wrote: > > > >> > >> Hi Jeremy, > >> > >> A single key makes it easier for generic code to recreate extension > >> types it does not know about. > >> > >> Here is an example in the C++ IPC layer: > >> > >> > https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845 > >> > >> Here is similar logic in the C++ bridge for the C Data Interface: > >> > >> > https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029 > >> > >> It is probably expected that many extension types will be parameter-less > >> (such as UUID, JSON, BSON...). > >> > >> It does imply that extension types with sophisticated parameterization > >> must implement a custom (de)serialization mechanism themselves. I'm not > >> sure this tradeoff was discussed at the time, perhaps other people (Wes? > >> Jacques?) may chime in. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> > >> Le 16/08/2023 à 16:32, Jeremy Leibs a écrit : > >>> Hello, > >>> > >>> I've recently started working with extension types as part of our > project > >>> and I was surprised to discover that extension types are required to > pack > >>> all of their own metadata into a single string value
Re: [VOTE] Apache Arrow ADBC (API) 1.1.0
+1 (binding) Le 14/08/2023 à 19:39, David Li a écrit : Hello, We have been discussing revisions [1] to the ADBC APIs, which we formerly decided to treat as a specification [2]. These revisions clean up various missing features (e.g. cancellation, error metadata) and better position ADBC to help different data systems interoperate (e.g. by exposing more metadata, like table/column statistics). For details, see the PR at [3]. (The main file to read through is adbc.h.) I would like to propose that the Arrow project adopt this RFC, along with the linked PR, as version 1.1.0 of the ADBC API standard. Please vote to adopt the specification as described above. This is not a vote to release any packages; the first package release to support version 1.1.0 of the APIs will be 0.7.0 of the packages. (So I will not merge the linked PR until after we release ADBC 0.6.0.) This vote will be open for at least 72 hours. [ ] +1 Adopt the ADBC 1.1.0 specification [ ] 0 [ ] -1 Do not adopt the specification because... Thanks to Sutou Kouhei, Matt Topol, Dewey Dunnington, Antoine Pitrou, Will Ayd, and Will Jones for feedback on the design and various work-in-progress PRs. [1]: https://github.com/apache/arrow-adbc/milestone/3 [2]: https://lists.apache.org/thread/s8m4l9hccfh5kqvvd2x3gxn3ry0w1ryo [3]: https://github.com/apache/arrow-adbc/pull/971 Thank you, David
Re: [Vote][Format] C Data Interface Format string for REE
+1 (binding) Cheers, -Jacob On Wed, Aug 16, 2023 at 8:16 AM Matt Topol wrote: > Hey All, > > As proposed by Felipe [1] I'm starting a vote on the proposed update to the > Format Spec of adding "+r" as the format string for passing Run-End Encoded > arrays through the Arrow C Data Interface. > > A PR containing an update to the C++ Arrow implementation to add support > for this format string along with documentation updates can be found here > [2]. > > The vote will be open for at least 72 hours. > > [ ] +1 - I'm in favor of this new C Data Format string > [ ] +0 > [ ] -1 - I'm against adding this new format string because > > Thanks everyone! > > --Matt > > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 > [2]: https://github.com/apache/arrow/pull/37174 >
Re: [Vote][Format] C Data Interface Format string for REE
+1 (binding) On 16/08/2023 16:58, Matt Topol wrote: It would be nice to get approval from authors of other implementations such as Rust, C#, Javascript... I'm hoping that some of them see this and participate in the vote. *crosses fingers* On Wed, Aug 16, 2023 at 11:10 AM Antoine Pitrou wrote: +1 from me (binding). It would be nice to get approval from authors of other implementations such as Rust, C#, Javascript... Thanks for doing this! Le 16/08/2023 à 16:16, Matt Topol a écrit : Hey All, As proposed by Felipe [1] I'm starting a vote on the proposed update to the Format Spec of adding "+r" as the format string for passing Run-End Encoded arrays through the Arrow C Data Interface. A PR containing an update to the C++ Arrow implementation to add support for this format string along with documentation updates can be found here [2]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of this new C Data Format string [ ] +0 [ ] -1 - I'm against adding this new format string because Thanks everyone! --Matt [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 [2]: https://github.com/apache/arrow/pull/37174
Re: [Vote][Format] C Data Interface Format string for REE
> It would be nice to get approval from authors of other implementations such as Rust, C#, Javascript... I'm hoping that some of them see this and participate in the vote. *crosses fingers* On Wed, Aug 16, 2023 at 11:10 AM Antoine Pitrou wrote: > > +1 from me (binding). > > It would be nice to get approval from authors of other implementations > such as Rust, C#, Javascript... > > Thanks for doing this! > > > Le 16/08/2023 à 16:16, Matt Topol a écrit : > > Hey All, > > > > As proposed by Felipe [1] I'm starting a vote on the proposed update to > the > > Format Spec of adding "+r" as the format string for passing Run-End > Encoded > > arrays through the Arrow C Data Interface. > > > > A PR containing an update to the C++ Arrow implementation to add support > > for this format string along with documentation updates can be found here > > [2]. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 - I'm in favor of this new C Data Format string > > [ ] +0 > > [ ] -1 - I'm against adding this new format string because > > > > Thanks everyone! > > > > --Matt > > > > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 > > [2]: https://github.com/apache/arrow/pull/37174 > > >
Re: [DISCUSS][Arrow] Extension metadata encoding design
Hmm, you're right that letting the extension type peek at the entire metadata values would have been another solution. That said, for protocol compatibility reasons, we cannot easily change this anymore. Regards Antoine. Le 16/08/2023 à 17:48, Jeremy Leibs a écrit : Thanks for the context, Antoine. However, even in those examples, I don't really see how coercing the metadata to a single string makes much of a difference. I believe the main difference of what I'm proposing would be that the ExtensionType::Deserialize interface: https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51 Would instead look like: ``` arrow::Result> Deserialize( std::shared_ptr storage_type, std::shared_ptr metadata) const; ``` In both of those cases though it seems like a valid std::shared_ptr is available to be passed to the extension. I suspect the more challenging case might be related to DataType equality checks? It would not be possible for generic code to know whether it can validly do things like concatenate two extension arrays without knowledge of which metadata keys are relevant to the extension. That said, with the current adhoc serialization of metadata to a string, different encoder-implementations still might still produce non-comparable strings, resulting in falsely reported datatype mismatches, but at least avoiding the case of false positives. On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou wrote: Hi Jeremy, A single key makes it easier for generic code to recreate extension types it does not know about. Here is an example in the C++ IPC layer: https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845 Here is similar logic in the C++ bridge for the C Data Interface: https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029 It is probably expected that many extension types will be parameter-less (such as UUID, JSON, BSON...). It does imply that extension types with sophisticated parameterization must implement a custom (de)serialization mechanism themselves. I'm not sure this tradeoff was discussed at the time, perhaps other people (Wes? Jacques?) may chime in. Regards Antoine. Le 16/08/2023 à 16:32, Jeremy Leibs a écrit : Hello, I've recently started working with extension types as part of our project and I was surprised to discover that extension types are required to pack all of their own metadata into a single string value of the "ARROW:extension:metadata" key. In turn this then means we have to endure arbitrary unstructured / hard-to-validate strings with custom encodings (e.g. JSON inside flatbuffer) when dealing with extensions. Can anyone provide some context on the rationale for this design decision? Given that we already have (1) a perfectly good metadata keyvalue store already in place, and (2) established recommendations for namespaced scoping of keys, why would we not just use that to store the metadata for the extension. For example: "ARROW:extension:name": "myorg.myextension", "myorg:myextension:meta1": "value1", "myorg:myextension:meta2": "value2", Thanks for any insights, Jeremy
Re: [DISCUSS][Arrow] Extension metadata encoding design
Thanks for the context, Antoine. However, even in those examples, I don't really see how coercing the metadata to a single string makes much of a difference. I believe the main difference of what I'm proposing would be that the ExtensionType::Deserialize interface: https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51 Would instead look like: ``` arrow::Result> Deserialize( std::shared_ptr storage_type, std::shared_ptr metadata) const; ``` In both of those cases though it seems like a valid std::shared_ptr is available to be passed to the extension. I suspect the more challenging case might be related to DataType equality checks? It would not be possible for generic code to know whether it can validly do things like concatenate two extension arrays without knowledge of which metadata keys are relevant to the extension. That said, with the current adhoc serialization of metadata to a string, different encoder-implementations still might still produce non-comparable strings, resulting in falsely reported datatype mismatches, but at least avoiding the case of false positives. On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou wrote: > > Hi Jeremy, > > A single key makes it easier for generic code to recreate extension > types it does not know about. > > Here is an example in the C++ IPC layer: > > https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845 > > Here is similar logic in the C++ bridge for the C Data Interface: > > https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029 > > It is probably expected that many extension types will be parameter-less > (such as UUID, JSON, BSON...). > > It does imply that extension types with sophisticated parameterization > must implement a custom (de)serialization mechanism themselves. I'm not > sure this tradeoff was discussed at the time, perhaps other people (Wes? > Jacques?) may chime in. > > Regards > > Antoine. > > > > Le 16/08/2023 à 16:32, Jeremy Leibs a écrit : > > Hello, > > > > I've recently started working with extension types as part of our project > > and I was surprised to discover that extension types are required to pack > > all of their own metadata into a single string value of the > > "ARROW:extension:metadata" key. > > > > In turn this then means we have to endure arbitrary unstructured / > > hard-to-validate strings with custom encodings (e.g. JSON inside > > flatbuffer) when dealing with extensions. > > > > Can anyone provide some context on the rationale for this design > decision? > > > > Given that we already have (1) a perfectly good metadata keyvalue store > > already in place, and (2) established recommendations for > > namespaced scoping of keys, why would we not just use that to store the > > metadata for the extension. For example: > > > > "ARROW:extension:name": "myorg.myextension", > > "myorg:myextension:meta1": "value1", > > "myorg:myextension:meta2": "value2", > > > > Thanks for any insights, > > Jeremy > > >
Re: [DISCUSS][Arrow] Extension metadata encoding design
Hi Jeremy, A single key makes it easier for generic code to recreate extension types it does not know about. Here is an example in the C++ IPC layer: https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845 Here is similar logic in the C++ bridge for the C Data Interface: https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029 It is probably expected that many extension types will be parameter-less (such as UUID, JSON, BSON...). It does imply that extension types with sophisticated parameterization must implement a custom (de)serialization mechanism themselves. I'm not sure this tradeoff was discussed at the time, perhaps other people (Wes? Jacques?) may chime in. Regards Antoine. Le 16/08/2023 à 16:32, Jeremy Leibs a écrit : Hello, I've recently started working with extension types as part of our project and I was surprised to discover that extension types are required to pack all of their own metadata into a single string value of the "ARROW:extension:metadata" key. In turn this then means we have to endure arbitrary unstructured / hard-to-validate strings with custom encodings (e.g. JSON inside flatbuffer) when dealing with extensions. Can anyone provide some context on the rationale for this design decision? Given that we already have (1) a perfectly good metadata keyvalue store already in place, and (2) established recommendations for namespaced scoping of keys, why would we not just use that to store the metadata for the extension. For example: "ARROW:extension:name": "myorg.myextension", "myorg:myextension:meta1": "value1", "myorg:myextension:meta2": "value2", Thanks for any insights, Jeremy
Re: [Vote][Format] C Data Interface Format string for REE
+1 from me (binding). It would be nice to get approval from authors of other implementations such as Rust, C#, Javascript... Thanks for doing this! Le 16/08/2023 à 16:16, Matt Topol a écrit : Hey All, As proposed by Felipe [1] I'm starting a vote on the proposed update to the Format Spec of adding "+r" as the format string for passing Run-End Encoded arrays through the Arrow C Data Interface. A PR containing an update to the C++ Arrow implementation to add support for this format string along with documentation updates can be found here [2]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of this new C Data Format string [ ] +0 [ ] -1 - I'm against adding this new format string because Thanks everyone! --Matt [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 [2]: https://github.com/apache/arrow/pull/37174
[DISCUSS][Arrow] Extension metadata encoding design
Hello, I've recently started working with extension types as part of our project and I was surprised to discover that extension types are required to pack all of their own metadata into a single string value of the "ARROW:extension:metadata" key. In turn this then means we have to endure arbitrary unstructured / hard-to-validate strings with custom encodings (e.g. JSON inside flatbuffer) when dealing with extensions. Can anyone provide some context on the rationale for this design decision? Given that we already have (1) a perfectly good metadata keyvalue store already in place, and (2) established recommendations for namespaced scoping of keys, why would we not just use that to store the metadata for the extension. For example: "ARROW:extension:name": "myorg.myextension", "myorg:myextension:meta1": "value1", "myorg:myextension:meta2": "value2", Thanks for any insights, Jeremy
Re: [Vote][Format] C Data Interface Format string for REE
+1 On Wed, Aug 16, 2023, at 10:21, Dewey Dunnington wrote: > +1! Looking forward to implementing this in nanoarrow! > > On Wed, Aug 16, 2023 at 11:18 AM Ian Cook wrote: >> >> +1 (non-binding) >> >> On Wed, Aug 16, 2023 at 10:16 AM Matt Topol >> wrote: >> > >> > Hey All, >> > >> > As proposed by Felipe [1] I'm starting a vote on the proposed update to the >> > Format Spec of adding "+r" as the format string for passing Run-End Encoded >> > arrays through the Arrow C Data Interface. >> > >> > A PR containing an update to the C++ Arrow implementation to add support >> > for this format string along with documentation updates can be found here >> > [2]. >> > >> > The vote will be open for at least 72 hours. >> > >> > [ ] +1 - I'm in favor of this new C Data Format string >> > [ ] +0 >> > [ ] -1 - I'm against adding this new format string because >> > >> > Thanks everyone! >> > >> > --Matt >> > >> > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 >> > [2]: https://github.com/apache/arrow/pull/37174
Re: [Vote][Format] C Data Interface Format string for REE
+1! Looking forward to implementing this in nanoarrow! On Wed, Aug 16, 2023 at 11:18 AM Ian Cook wrote: > > +1 (non-binding) > > On Wed, Aug 16, 2023 at 10:16 AM Matt Topol > wrote: > > > > Hey All, > > > > As proposed by Felipe [1] I'm starting a vote on the proposed update to the > > Format Spec of adding "+r" as the format string for passing Run-End Encoded > > arrays through the Arrow C Data Interface. > > > > A PR containing an update to the C++ Arrow implementation to add support > > for this format string along with documentation updates can be found here > > [2]. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 - I'm in favor of this new C Data Format string > > [ ] +0 > > [ ] -1 - I'm against adding this new format string because > > > > Thanks everyone! > > > > --Matt > > > > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 > > [2]: https://github.com/apache/arrow/pull/37174
Re: [Vote][Format] C Data Interface Format string for REE
+1 (non-binding) On Wed, Aug 16, 2023 at 10:16 AM Matt Topol wrote: > > Hey All, > > As proposed by Felipe [1] I'm starting a vote on the proposed update to the > Format Spec of adding "+r" as the format string for passing Run-End Encoded > arrays through the Arrow C Data Interface. > > A PR containing an update to the C++ Arrow implementation to add support > for this format string along with documentation updates can be found here > [2]. > > The vote will be open for at least 72 hours. > > [ ] +1 - I'm in favor of this new C Data Format string > [ ] +0 > [ ] -1 - I'm against adding this new format string because > > Thanks everyone! > > --Matt > > [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 > [2]: https://github.com/apache/arrow/pull/37174
[Vote][Format] C Data Interface Format string for REE
Hey All, As proposed by Felipe [1] I'm starting a vote on the proposed update to the Format Spec of adding "+r" as the format string for passing Run-End Encoded arrays through the Arrow C Data Interface. A PR containing an update to the C++ Arrow implementation to add support for this format string along with documentation updates can be found here [2]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of this new C Data Format string [ ] +0 [ ] -1 - I'm against adding this new format string because Thanks everyone! --Matt [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781 [2]: https://github.com/apache/arrow/pull/37174
Arrow community meeting August 16 at 16:00 UTC
Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT. Zoom meeting URL: https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09 Meeting ID: 876 4903 3008 Passcode: 958092 Meeting notes will be captured in this Google Doc: https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/ If you plan to attend this meeting, you are welcome to edit the document to add the topics that you would like to discuss. Thanks, Ian