Hi all, As promised, I converted the design document [1] into an initial PR [2]. Rather than draft the whole header, I started with README + implementations + testing for error handling and schema allocation (depending on feedback, next week I will draft another reviewable chunk).
Also feel free to suggest another place to put this if one exists (the choice to put it in its own repo was based on informal feedback that perhaps that might be the best way to go). [1] https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing [2] https://github.com/paleolimbot/arrow-c/pull/1/files On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com> wrote: > Hi all, > > Based on the points raised above and a few adventures implementing some of > this in related projects, I put together a brief design document proposing > a scope and structure to perhaps solidify a few of these discussions: > https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing > . > > Any and all should feel free to add, rewrite, or propose a new > structure...I wrote many of the pieces for argument's sake or because > that's how I'd implemented them before. > > Next week I will phrase it as a skeleton header (like the one in the > excellent ADBC design discussions) depending on feedback to keep the > discussion going! > > Cheers, > > -dewey > > On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <han...@duckdblabs.com> > wrote: > >> Hello List, >> >> we at DuckDB are happy users of the Arrow C Data Interface and use it to >> feed SQL queries and also use it to provide query results in Arrow format >> again. It is particularly appealing to us that the interface is merely a >> (C) header file that we just ship with our source code [1]. Internally, >> our >> implementation then constructs DuckDB internal vectors from the Arrow >> format [2] or vice-versa [3]. >> >> As you can see from [2, 3] there is some complexity in getting the >> conversion right, especially for more complex data types like nested types >> (list, strings). A lightweight, dependency-free library to help >> constructing those would certainly be appreciated. What would also help a >> lot is validation code, Arrow structures are very delicate and one wrong >> pointer can lead to disaster (which is then blamed on us), so a way to >> verify the structures in said lightweight library would be very helpful. >> >> Best from Amsterdam, and Quack >> >> Hannes >> >> [1] >> >> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp >> [2] >> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp >> [3] >> >> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp >> >> >> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jke...@gmail.com> >> wrote: >> >> > cc Hannes Mühleisen from DuckDB Labs >> > >> > -Jon >> > >> > >> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> > I'm also supportive of having a small vendorable C/C++ "Arrow >> > middleware" that provides: >> > >> > * Schemas and types >> > * Columnar data structures and minimal APIs to build them and iterate >> over >> > them >> > * C data interface >> > * Minimal validation (at the level of Validate but not ValidateFull) >> > >> > I don't think it's going to be practical to try to refactor parts of >> > the existing Arrow C++ core to be vendorable since there are many >> > features / requirements (e.g. an extensible buffer and device API) >> > that these C++ classes include that aren't needed in this >> > limited-feature middleware library. >> > >> > This also relates to the "Improving Arrow's database support" project >> > that David Li raised some time ago [1]. If we want to encourage >> > database driver libraries to add new APIs that emit the Arrow C >> > interface, we need to make it easier to generate the C interface >> > without requiring a new library dependency. >> > >> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w >> > >> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jke...@gmail.com> >> wrote: >> > > >> > > Thanks for working on this. I've heard people asking about something >> > > like this from a number of different fronts on top of the obvious use >> > > case in geoarrow | other geospatial libraries. I think a minimal piece >> > > of Arrow that other packages could depend on without needing to bring >> > > in all of arrow would be super valuable in building the bridges we >> > > want across other systems. >> > > >> > > Do you have any (design) documentation that describes the scope of >> > > what you're thinking? I know there have been others floating around >> > > [1] [2] that were in a similar spirit. >> > > >> > > A few more questions I hope will spark more conversation: How do the >> > > header files you linked in [3] overlap with these other efforts? Are >> > > those headers something we could|should "just" PR into apache/arrow >> > > and write up how to use them? If not what is the work to make them so >> > > that they could be (the answer of course could be design something >> > > else entirely and PR that!)? >> > > >> > > [1] https://github.com/paleolimbot/narrow >> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html >> > > [3] >> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/ >> > internal/arrow-hpp >> > > >> > > -Jon >> > > >> > > -Jon >> > > >> > > >> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington < >> de...@voltrondata.com> >> > wrote: >> > > > >> > > > I'm writing to gauge interest in a set of helpers in C and/or C++ >> for >> > > > reading/exporting Arrow C Data interface structures. My use-case is >> > > > building Arrow geospatial support in R [1], and while the set of >> > helpers >> > > > I've been using [2] has served the purpose of me writing about the >> > > > opportunities for Arrow + geospatial [3], I would like to rewrite >> the >> > > > prototype based on something developed by/with the Arrow community. >> > > > >> > > > Does a set of C/C++ helpers for Arrow C Data interface structures >> > already >> > > > exist? *Should* it exist? >> > > > >> > > > If it doesn't, what should the name/scope of that library be? The >> names >> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced >> in >> > my >> > > > limited discussion of this so far. For the purpose of starting the >> > > > discussion, I'll posit that the library should include helpers to >> > > > allocate/destroy C Data interface structures, a schema metadata >> > > > encoder/decoder, validation of a schema/array pair, and something >> like >> > the >> > > > ArrayBuilder C++ class. >> > > > >> > > > [1] >> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7 >> > > > [2] >> > > > https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/ >> > internal/arrow-hpp >> > > > [3] >> > > > https://docs.google.com/document/d/ >> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing >> > >> > >> >