Re: [PR] feat: Support for Binaryview and StringView types [arrow-nanoarrow]

via GitHub Sun, 21 Jan 2024 11:19:50 -0800


paleolimbot commented on PR #367:
URL: https://github.com/apache/arrow-nanoarrow/pull/367#issuecomment-1902733790


   > Do you know if those types will be actually found in serialized formats, 
namely Parquet and Feather ?
   
   I don't believe it's possible to get a stringview or binaryview from 
Parquet; however, in theory you could get one from Feather or Arrow IPC today 
with Arrow 15. I think that until pyarrow implements a basic level of support 
it's unlikely to actually end up being used. Some day the C++ Parquet scanner 
might be able to return those types but there are very few people working on 
that part of the code base and I think it's unlikely to be implemented any time 
soon.
   
   > Their value proposition compared to regular string or binary is unclear to 
me.
   
   I think the gist of it is that regular string and binary are slow to sort, 
which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it 
as their primary representation. Arrow added it primarily for interchange with 
those systems (e.g., a user-defined function based on the C Data interface) 
although I agree that it came at the expense unnecessary complexity for 99.9% 
of Arrow users.
   
   > I should point that the proliferation of basic data types is going to be a 
serious obstacle to adoption by new implementations.
   
   I agree. In theory this is what nanoarrow is designed to help with (when the 
types are supported...so not yet). The focus of the version about to be 
released is testing and stability...0.5.0 is more likely to include features 
that could be used in a fallback sort of way (i.e., maybe 
`ArrowArrayViewGetLogicalXXX()` that would handle the dictionary, 
run-end-encoded, view, or normal cases at the expense of an extra `switch()`).
   
   > Perhaps there should be some "data type negociation" mechanism
   
   In Arrow C++, this is most likely to be supported via an option in the 
readers. For Parquet  In Python, the `__arrow_c_schema__()` and 
`__arrow_c_stream__(requested_schema=None)` methods can in theory handle this 
(i.e., you query `__arrow_c_schema__()` and if it contains types you don't 
understand, you pass a different schema to `__arrow_c_stream__()`'s 
`requested_schema`. That said, pyarrow doesn't implement `requested_schema` 
yet, but it also doesn't really implement the view types either.
   
   > And I hope that I won't have to add support for them in 
OGRLayer::WriteArrowBatch()
   
   For now I think you will be hard-pressed to find a producer that actually 
produces REE or View types (ListView is also on the horizon if it's not already 
implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them 
(but Python bindings are on the roadmap first, which is no small 
feat!)...perhaps it can help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Support for Binaryview and StringView types [arrow-nanoarrow]

Reply via email to