Hi all, I drafted a second PR [1] drafting a design for storing parsed information obtained from a struct ArrowSchema (i.e., parsing the format string into usable C structures). There are some unsolved problems that could use a fresh perspective...all comments welcome!
[1] https://github.com/paleolimbot/arrow-c/pull/5 On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <de...@voltrondata.com> wrote: > Hi all, > > As promised, I converted the design document [1] into an initial PR [2]. > Rather than draft the whole header, I started with README + implementations > + testing for error handling and schema allocation (depending on feedback, > next week I will draft another reviewable chunk). > > Also feel free to suggest another place to put this if one exists (the > choice to put it in its own repo was based on informal feedback that > perhaps that might be the best way to go). > > [1] > https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing > [2] https://github.com/paleolimbot/arrow-c/pull/1/files > > On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com> > wrote: > >> Hi all, >> >> Based on the points raised above and a few adventures implementing some >> of this in related projects, I put together a brief design document >> proposing a scope and structure to perhaps solidify a few of these >> discussions: >> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing >> . >> >> Any and all should feel free to add, rewrite, or propose a new >> structure...I wrote many of the pieces for argument's sake or because >> that's how I'd implemented them before. >> >> Next week I will phrase it as a skeleton header (like the one in the >> excellent ADBC design discussions) depending on feedback to keep the >> discussion going! >> >> Cheers, >> >> -dewey >> >> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <han...@duckdblabs.com> >> wrote: >> >>> Hello List, >>> >>> we at DuckDB are happy users of the Arrow C Data Interface and use it to >>> feed SQL queries and also use it to provide query results in Arrow format >>> again. It is particularly appealing to us that the interface is merely a >>> (C) header file that we just ship with our source code [1]. Internally, >>> our >>> implementation then constructs DuckDB internal vectors from the Arrow >>> format [2] or vice-versa [3]. >>> >>> As you can see from [2, 3] there is some complexity in getting the >>> conversion right, especially for more complex data types like nested >>> types >>> (list, strings). A lightweight, dependency-free library to help >>> constructing those would certainly be appreciated. What would also help a >>> lot is validation code, Arrow structures are very delicate and one wrong >>> pointer can lead to disaster (which is then blamed on us), so a way to >>> verify the structures in said lightweight library would be very helpful. >>> >>> Best from Amsterdam, and Quack >>> >>> Hannes >>> >>> [1] >>> >>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp >>> [2] >>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp >>> [3] >>> >>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp >>> >>> >>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jke...@gmail.com> >>> wrote: >>> >>> > cc Hannes Mühleisen from DuckDB Labs >>> > >>> > -Jon >>> > >>> > >>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> > >>> > I'm also supportive of having a small vendorable C/C++ "Arrow >>> > middleware" that provides: >>> > >>> > * Schemas and types >>> > * Columnar data structures and minimal APIs to build them and iterate >>> over >>> > them >>> > * C data interface >>> > * Minimal validation (at the level of Validate but not ValidateFull) >>> > >>> > I don't think it's going to be practical to try to refactor parts of >>> > the existing Arrow C++ core to be vendorable since there are many >>> > features / requirements (e.g. an extensible buffer and device API) >>> > that these C++ classes include that aren't needed in this >>> > limited-feature middleware library. >>> > >>> > This also relates to the "Improving Arrow's database support" project >>> > that David Li raised some time ago [1]. If we want to encourage >>> > database driver libraries to add new APIs that emit the Arrow C >>> > interface, we need to make it easier to generate the C interface >>> > without requiring a new library dependency. >>> > >>> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w >>> > >>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jke...@gmail.com> >>> wrote: >>> > > >>> > > Thanks for working on this. I've heard people asking about something >>> > > like this from a number of different fronts on top of the obvious use >>> > > case in geoarrow | other geospatial libraries. I think a minimal >>> piece >>> > > of Arrow that other packages could depend on without needing to bring >>> > > in all of arrow would be super valuable in building the bridges we >>> > > want across other systems. >>> > > >>> > > Do you have any (design) documentation that describes the scope of >>> > > what you're thinking? I know there have been others floating around >>> > > [1] [2] that were in a similar spirit. >>> > > >>> > > A few more questions I hope will spark more conversation: How do the >>> > > header files you linked in [3] overlap with these other efforts? Are >>> > > those headers something we could|should "just" PR into apache/arrow >>> > > and write up how to use them? If not what is the work to make them so >>> > > that they could be (the answer of course could be design something >>> > > else entirely and PR that!)? >>> > > >>> > > [1] https://github.com/paleolimbot/narrow >>> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html >>> > > [3] >>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/ >>> > internal/arrow-hpp >>> > > >>> > > -Jon >>> > > >>> > > -Jon >>> > > >>> > > >>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington < >>> de...@voltrondata.com> >>> > wrote: >>> > > > >>> > > > I'm writing to gauge interest in a set of helpers in C and/or C++ >>> for >>> > > > reading/exporting Arrow C Data interface structures. My use-case is >>> > > > building Arrow geospatial support in R [1], and while the set of >>> > helpers >>> > > > I've been using [2] has served the purpose of me writing about the >>> > > > opportunities for Arrow + geospatial [3], I would like to rewrite >>> the >>> > > > prototype based on something developed by/with the Arrow community. >>> > > > >>> > > > Does a set of C/C++ helpers for Arrow C Data interface structures >>> > already >>> > > > exist? *Should* it exist? >>> > > > >>> > > > If it doesn't, what should the name/scope of that library be? The >>> names >>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all >>> surfaced in >>> > my >>> > > > limited discussion of this so far. For the purpose of starting the >>> > > > discussion, I'll posit that the library should include helpers to >>> > > > allocate/destroy C Data interface structures, a schema metadata >>> > > > encoder/decoder, validation of a schema/array pair, and something >>> like >>> > the >>> > > > ArrayBuilder C++ class. >>> > > > >>> > > > [1] >>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7 >>> > > > [2] >>> > > > >>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/ >>> > internal/arrow-hpp >>> > > > [3] >>> > > > https://docs.google.com/document/d/ >>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing >>> > >>> > >>> >>