paleolimbot commented on PR #367: URL: https://github.com/apache/arrow-nanoarrow/pull/367#issuecomment-1902733790
> Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ? I don't believe it's possible to get a stringview or binaryview from Parquet; however, in theory you could get one from Feather or Arrow IPC today with Arrow 15. I think that until pyarrow implements a basic level of support it's unlikely to actually end up being used. Some day the C++ Parquet scanner might be able to return those types but there are very few people working on that part of the code base and I think it's unlikely to be implemented any time soon. > Their value proposition compared to regular string or binary is unclear to me. I think the gist of it is that regular string and binary are slow to sort, which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it as their primary representation. Arrow added it primarily for interchange with those systems (e.g., a user-defined function based on the C Data interface) although I agree that it came at the expense unnecessary complexity for 99.9% of Arrow users. > I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations. I agree. In theory this is what nanoarrow is designed to help with (when the types are supported...so not yet). The focus of the version about to be released is testing and stability...0.5.0 is more likely to include features that could be used in a fallback sort of way (i.e., maybe `ArrowArrayViewGetLogicalXXX()` that would handle the dictionary, run-end-encoded, view, or normal cases at the expense of an extra `switch()`). > Perhaps there should be some "data type negociation" mechanism In Arrow C++, this is most likely to be supported via an option in the readers. For Parquet In Python, the `__arrow_c_schema__()` and `__arrow_c_stream__(requested_schema=None)` methods can in theory handle this (i.e., you query `__arrow_c_schema__()` and if it contains types you don't understand, you pass a different schema to `__arrow_c_stream__()`'s `requested_schema`. That said, pyarrow doesn't implement `requested_schema` yet, but it also doesn't really implement the view types either. > And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch() For now I think you will be hard-pressed to find a producer that actually produces REE or View types (ListView is also on the horizon if it's not already implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them (but Python bindings are on the roadmap first, which is no small feat!)...perhaps it can help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
