Re: [DISCUSS] Splitting out the Arrow format directory
Agreed. I hope that I didn't come off as flippant with respect to performance. I was hoping to convey that I think focusing on performance before we have the semantics and high level design nailed down is not time well spent. I think the current design doesn't depend on the format, which is a good thing: we can pick the format that best suits the needs of the community, and since performance is a big part of arrow, that likely means picking a format that is also geared towards performance. On Fri, Aug 13, 2021 at 2:57 PM Keith Kraus wrote: > > Personally, I do not care about the speed of IR processing right now. > > Any non-trivial (and probably trivial too) computation done > > by an IR consumer will dwarf the cost of IR processing. Of course, > > we shouldn't prematurely pessimize either, but there's no reason > > to spend time worrying about IR processing performance in my opinion > (yet). > > In other processing engines I've seen situations somewhat commonly where > the time to build the compute graph becomes non-negligible and even more > expensive than doing the computation itself. I've even seen situations > where attempts were made to iteratively build a graph while executing in > order to try to overlap the cost of building the graph with the compute > execution. > > There's been a huge amount of effort put into optimizing critical kernel > components like the hash table implementation in order to make Arrow the > most performant analytical library possible. Architecting and designing the > IR implementation without performance in mind from the beginning could > potentially put us into a difficult situation later that we'd have to > invest considerably more effort to work our way out of. > > On Fri, Aug 13, 2021 at 2:30 PM Weston Pace wrote: > > > I believe you would need a JSON compatible version of the type system > > (including binary values) because you'd need to at least encode > > literals. However, I don't think that creating a human readable > > encoding of the Arrow type system is a bad thing in and of itself. We > > have tickets and get questions occasionally asking for a JSON format. > > This could at least be a step in that direction. I don't think you'd > > need to add support for arrays/batches/tables. Note, the C++ > > implementation has a JSON format that is used for testing purposes > > (though I do not believe it is comprehensive). > > > > I think we could add two (potentially conflicting) requirements > > * Low barrier to entry for consumers > > * Low barrier to entry for producers > > > > JSON/YAML seem to lower the barrier to entry for producers. Some > > producers may not even be working with Arrow data (e.g. could one go > > from SQL-literal -> JSON-literal skipping an intermediate > > Arrow-literal step?). I think we've also dismissed Antoine's earlier > > point which I found the most compelling. Handling flatbuffers adds > > one more step that people have to integrate into their build systems. > > > > Flatbuffers on the other hand lowers the barrier to entry for > > consumers. A consumer is likely already going to have flatbuffers > > support built in so that they can read/write IPC files. If we adopt > > JSON then the consumer will have to add support for a new file format > > (or at least part of one). > > > > On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn > > wrote: > > > > > > > > > > > I just thought of one other requirement: the format needs to support > > > > arbitrary byte sequences. > > > > > > > Can you clarify why this is needed? Is it that custom_metadata maps > > should > > > allow byte sequences as values? > > > > > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud > > wrote: > > > > > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou > > > > wrote: > > > > > > > > > > > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > > > > > > > > > >> I.e. make the ability to read and write by humans be more > > important > > > > than > > > > > >> speed of validation. > > > > > > > > > > > > I think I differ on whether the IR should be easy to read and > > write by > > > > > > humans. > > > > > > IR is going to be predominantly read and written by machines, > > though of > > > > > > course > > > > > > we will need a way to inspect it for debugging. > > > > > > > > > > But the code executed by machines is written by humans. I think > > that's > > > > > mostly where the contention resides: is it easy to code, in any > given > > > > > language, the routines required to produce or consume the IR? > > > > > > > > > > > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to > > use in > > > > any language except C++, > > > > and it's borderline annoying there too. Protobuf is similar (less > > annoying > > > > in Rust, > > > > but still annoying in Python and C++ IMO), though I think any binary > > format > > > > is going to be > > > > less human-friendly, by construction. > > > > > > > > If we were to use something like JSON or
Re: [DISCUSS] Splitting out the Arrow format directory
> Personally, I do not care about the speed of IR processing right now. > Any non-trivial (and probably trivial too) computation done > by an IR consumer will dwarf the cost of IR processing. Of course, > we shouldn't prematurely pessimize either, but there's no reason > to spend time worrying about IR processing performance in my opinion (yet). In other processing engines I've seen situations somewhat commonly where the time to build the compute graph becomes non-negligible and even more expensive than doing the computation itself. I've even seen situations where attempts were made to iteratively build a graph while executing in order to try to overlap the cost of building the graph with the compute execution. There's been a huge amount of effort put into optimizing critical kernel components like the hash table implementation in order to make Arrow the most performant analytical library possible. Architecting and designing the IR implementation without performance in mind from the beginning could potentially put us into a difficult situation later that we'd have to invest considerably more effort to work our way out of. On Fri, Aug 13, 2021 at 2:30 PM Weston Pace wrote: > I believe you would need a JSON compatible version of the type system > (including binary values) because you'd need to at least encode > literals. However, I don't think that creating a human readable > encoding of the Arrow type system is a bad thing in and of itself. We > have tickets and get questions occasionally asking for a JSON format. > This could at least be a step in that direction. I don't think you'd > need to add support for arrays/batches/tables. Note, the C++ > implementation has a JSON format that is used for testing purposes > (though I do not believe it is comprehensive). > > I think we could add two (potentially conflicting) requirements > * Low barrier to entry for consumers > * Low barrier to entry for producers > > JSON/YAML seem to lower the barrier to entry for producers. Some > producers may not even be working with Arrow data (e.g. could one go > from SQL-literal -> JSON-literal skipping an intermediate > Arrow-literal step?). I think we've also dismissed Antoine's earlier > point which I found the most compelling. Handling flatbuffers adds > one more step that people have to integrate into their build systems. > > Flatbuffers on the other hand lowers the barrier to entry for > consumers. A consumer is likely already going to have flatbuffers > support built in so that they can read/write IPC files. If we adopt > JSON then the consumer will have to add support for a new file format > (or at least part of one). > > On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn > wrote: > > > > > > > > I just thought of one other requirement: the format needs to support > > > arbitrary byte sequences. > > > > > Can you clarify why this is needed? Is it that custom_metadata maps > should > > allow byte sequences as values? > > > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud > wrote: > > > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou > > > wrote: > > > > > > > > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > > > > > > > >> I.e. make the ability to read and write by humans be more > important > > > than > > > > >> speed of validation. > > > > > > > > > > I think I differ on whether the IR should be easy to read and > write by > > > > > humans. > > > > > IR is going to be predominantly read and written by machines, > though of > > > > > course > > > > > we will need a way to inspect it for debugging. > > > > > > > > But the code executed by machines is written by humans. I think > that's > > > > mostly where the contention resides: is it easy to code, in any given > > > > language, the routines required to produce or consume the IR? > > > > > > > > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to > use in > > > any language except C++, > > > and it's borderline annoying there too. Protobuf is similar (less > annoying > > > in Rust, > > > but still annoying in Python and C++ IMO), though I think any binary > format > > > is going to be > > > less human-friendly, by construction. > > > > > > If we were to use something like JSON or msgpack, can someone sketch > out > > > the interaction > > > between the IR and the rest of arrow's type system? > > > > > > Would we need a JSON-encoded-arrow-type -> in-memory representation > for an > > > Arrow type in a given language? > > > > > > I just thought of one other requirement: the format needs to support > > > arbitrary byte sequences. JSON > > > doesn't support untransformed byte sequences, though it's not uncommon > to > > > base64-encode a byte sequence. > > > IMO that adds an unnecessary layer of complexity, which is another > tradeoff > > > to consider. > > > >
Re: [DISCUSS] Splitting out the Arrow format directory
I believe you would need a JSON compatible version of the type system (including binary values) because you'd need to at least encode literals. However, I don't think that creating a human readable encoding of the Arrow type system is a bad thing in and of itself. We have tickets and get questions occasionally asking for a JSON format. This could at least be a step in that direction. I don't think you'd need to add support for arrays/batches/tables. Note, the C++ implementation has a JSON format that is used for testing purposes (though I do not believe it is comprehensive). I think we could add two (potentially conflicting) requirements * Low barrier to entry for consumers * Low barrier to entry for producers JSON/YAML seem to lower the barrier to entry for producers. Some producers may not even be working with Arrow data (e.g. could one go from SQL-literal -> JSON-literal skipping an intermediate Arrow-literal step?). I think we've also dismissed Antoine's earlier point which I found the most compelling. Handling flatbuffers adds one more step that people have to integrate into their build systems. Flatbuffers on the other hand lowers the barrier to entry for consumers. A consumer is likely already going to have flatbuffers support built in so that they can read/write IPC files. If we adopt JSON then the consumer will have to add support for a new file format (or at least part of one). On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn wrote: > > > > > I just thought of one other requirement: the format needs to support > > arbitrary byte sequences. > > > Can you clarify why this is needed? Is it that custom_metadata maps should > allow byte sequences as values? > > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud wrote: > > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou > > wrote: > > > > > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > > > > > >> I.e. make the ability to read and write by humans be more important > > than > > > >> speed of validation. > > > > > > > > I think I differ on whether the IR should be easy to read and write by > > > > humans. > > > > IR is going to be predominantly read and written by machines, though of > > > > course > > > > we will need a way to inspect it for debugging. > > > > > > But the code executed by machines is written by humans. I think that's > > > mostly where the contention resides: is it easy to code, in any given > > > language, the routines required to produce or consume the IR? > > > > > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in > > any language except C++, > > and it's borderline annoying there too. Protobuf is similar (less annoying > > in Rust, > > but still annoying in Python and C++ IMO), though I think any binary format > > is going to be > > less human-friendly, by construction. > > > > If we were to use something like JSON or msgpack, can someone sketch out > > the interaction > > between the IR and the rest of arrow's type system? > > > > Would we need a JSON-encoded-arrow-type -> in-memory representation for an > > Arrow type in a given language? > > > > I just thought of one other requirement: the format needs to support > > arbitrary byte sequences. JSON > > doesn't support untransformed byte sequences, though it's not uncommon to > > base64-encode a byte sequence. > > IMO that adds an unnecessary layer of complexity, which is another tradeoff > > to consider. > >
Re: [DISCUSS] Splitting out the Arrow format directory
> > I just thought of one other requirement: the format needs to support > arbitrary byte sequences. > Can you clarify why this is needed? Is it that custom_metadata maps should allow byte sequences as values? On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud wrote: > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou > wrote: > > > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > > > >> I.e. make the ability to read and write by humans be more important > than > > >> speed of validation. > > > > > > I think I differ on whether the IR should be easy to read and write by > > > humans. > > > IR is going to be predominantly read and written by machines, though of > > > course > > > we will need a way to inspect it for debugging. > > > > But the code executed by machines is written by humans. I think that's > > mostly where the contention resides: is it easy to code, in any given > > language, the routines required to produce or consume the IR? > > > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in > any language except C++, > and it's borderline annoying there too. Protobuf is similar (less annoying > in Rust, > but still annoying in Python and C++ IMO), though I think any binary format > is going to be > less human-friendly, by construction. > > If we were to use something like JSON or msgpack, can someone sketch out > the interaction > between the IR and the rest of arrow's type system? > > Would we need a JSON-encoded-arrow-type -> in-memory representation for an > Arrow type in a given language? > > I just thought of one other requirement: the format needs to support > arbitrary byte sequences. JSON > doesn't support untransformed byte sequences, though it's not uncommon to > base64-encode a byte sequence. > IMO that adds an unnecessary layer of complexity, which is another tradeoff > to consider. >
Re: [DISCUSS] Splitting out the Arrow format directory
On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou wrote: > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit : > > > >> I.e. make the ability to read and write by humans be more important than > >> speed of validation. > > > > I think I differ on whether the IR should be easy to read and write by > > humans. > > IR is going to be predominantly read and written by machines, though of > > course > > we will need a way to inspect it for debugging. > > But the code executed by machines is written by humans. I think that's > mostly where the contention resides: is it easy to code, in any given > language, the routines required to produce or consume the IR? > Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in any language except C++, and it's borderline annoying there too. Protobuf is similar (less annoying in Rust, but still annoying in Python and C++ IMO), though I think any binary format is going to be less human-friendly, by construction. If we were to use something like JSON or msgpack, can someone sketch out the interaction between the IR and the rest of arrow's type system? Would we need a JSON-encoded-arrow-type -> in-memory representation for an Arrow type in a given language? I just thought of one other requirement: the format needs to support arbitrary byte sequences. JSON doesn't support untransformed byte sequences, though it's not uncommon to base64-encode a byte sequence. IMO that adds an unnecessary layer of complexity, which is another tradeoff to consider.
Re: [DISCUSS] Splitting out the Arrow format directory
Le 13/08/2021 à 17:35, Phillip Cloud a écrit : I.e. make the ability to read and write by humans be more important than speed of validation. I think I differ on whether the IR should be easy to read and write by humans. IR is going to be predominantly read and written by machines, though of course we will need a way to inspect it for debugging. But the code executed by machines is written by humans. I think that's mostly where the contention resides: is it easy to code, in any given language, the routines required to produce or consume the IR?
Re: [DISCUSS] Splitting out the Arrow format directory
On Fri, Aug 13, 2021 at 8:03 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > What about: > > 1. easy to read and write by a large number of programming languages > Personally, I do not care about the speed of IR processing right now. Any non-trivial (and probably trivial too) computation done by an IR consumer will dwarf the cost of IR processing. Of course, we shouldn't prematurely pessimize either, but there's no reason to spend time worrying about IR processing performance in my opinion (yet). > 2. easy to read and write by humans > I think this is where I differ. Would you accept "easy to transform into something that can be read and written by humans" ? For example, you can turn a flatbuffer blob into its JSON equivalent using a few command line flags passed to flatc. That way, the IR can be flatbuffers, but if at any point someone wants to look at some other than a meaningless blob of bytes, they can. > 3. fast to validate by a large number of programming languages > I guess it depends on what fast means here, as well as the programming language and implementation of the validator. In my view, this falls under "let's not worry worry about performance yet". To that point, I think a structured format like protobuf or flatbuffers let's us punt on performance for now. A counter-argument might be "if we're punting on performance, then why not pick the one that's easiest to debug?" My only answer to that is reuse of existing flatbuffers types, which requires some work (at some point) to figure out how to distribute the generated code. With JSON/TOML/YAML we would have to build that. Maybe it's not a lot of effort, but I guess my inclination is to write more CI code, rather than library code if that's an option :) > > I.e. make the ability to read and write by humans be more important than > speed of validation. I think I differ on whether the IR should be easy to read and write by humans. IR is going to be predominantly read and written by machines, though of course we will need a way to inspect it for debugging. > > In this order, JSON/toml/yaml are preferred because they are supported by > more languages and more human readable than faster to validate. > > - > > My understanding is that for an async experience, we need the ability to > `.await` at any `read_X` call so that if the read_X requests more bytes > than are buffered, the `read_X(...).await` triggers a new (async) request > to fill the buffer (which puts the future on a Pending state). When a > library does not offer the async version of `read_X`, any read_X can force > a request to fill the buffer, which is now blocking the thread. One way > around this is to wrap those blocking calls in async (e.g. via > tokio::spawn_blocking). However, this forces users to use that runtime, or > to create a new independent thread pool for their own async work. Neither > are great for low-level libraries. > > I think I'm still missing something here. You can asynchronously read arbitrary byte sequences from a wide variety of IO sources and then parse the bytes into the desired format. I don't follow why that isn't sufficient to take advantage of async. A library like tonic for example, doesn't require that prost implement async APIs (I still don't know what that would mean for an in-memory format), yet tonic takes full advantage of async. In fact, I think it's _only_ async. I could understand the desire for a library to provide something like a capital-S Stream where the bytes are consumed asynchronously. Is that what you're after here? > E.g. thrift does not offer async -> parquet-format-rs does not offer async > -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded > and CPU-bounded operations" in spawn_blocking or something equivalent. > Best, > Jorge > > > On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud wrote: > > > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > I agree with Antoine that we should weigh the pros and cons of > > flatbuffers > > > (or protobuf or thrift for that matter) over a more human-friendly, > > > simpler, format like json or MsgPack. I also struggle a bit to reason > > with > > > the complexity of using flatbuffers for this. > > > > > > > Ultimately I think different representations of the format will emerge if > > compute IR is successful, > > and people will implement JSON/proto/thrift/etc versions of the IR. > > > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here. > > Beyond that, > > there's precedence in
Re: [DISCUSS] Splitting out the Arrow format directory
On Fri, Aug 13, 2021 at 2:03 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > What about: > > 1. easy to read and write by a large number of programming languages > 2. easy to read and write by humans > 3. fast to validate by a large number of programming languages > > I.e. make the ability to read and write by humans be more important than > speed of validation. > > In this order, JSON/toml/yaml are preferred because they are supported by > more languages and more human readable than faster to validate. > I am not sure that using JSON would make the IR “faster to validate” because the validation I believe we care more about is that the IR is consistent with the specification. When you use Flatbuffers, the schema verifier is built-in the library. With JSON, many implementations must determine for themselves whether the data is incorrectly constructed (there are of course libraries and frameworks available which help with enforcing JSON schemas nowadays). I think it would be fine to have a JSON alternative format for the IR but as the canonical/primary representation I believe it would make for a net-higher implementation burden (to make something really robust at least) for IR users. > - > > My understanding is that for an async experience, we need the ability to > `.await` at any `read_X` call so that if the read_X requests more bytes > than are buffered, the `read_X(...).await` triggers a new (async) request > to fill the buffer (which puts the future on a Pending state). When a > library does not offer the async version of `read_X`, any read_X can force > a request to fill the buffer, which is now blocking the thread. One way > around this is to wrap those blocking calls in async (e.g. via > tokio::spawn_blocking). However, this forces users to use that runtime, or > to create a new independent thread pool for their own async work. Neither > are great for low-level libraries. > > E.g. thrift does not offer async -> parquet-format-rs does not offer async > -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded > and CPU-bounded operations" in spawn_blocking or something equivalent. > > Best, > Jorge > > > On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud wrote: > > > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > I agree with Antoine that we should weigh the pros and cons of > > flatbuffers > > > (or protobuf or thrift for that matter) over a more human-friendly, > > > simpler, format like json or MsgPack. I also struggle a bit to reason > > with > > > the complexity of using flatbuffers for this. > > > > > > > Ultimately I think different representations of the format will emerge if > > compute IR is successful, > > and people will implement JSON/proto/thrift/etc versions of the IR. > > > > The requirements for the compute IR as I see it are: > > > > * Implementations in IR producer and consumer languages. > > * Strongly typed or the ability to easily validate a payload > > > > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here. > > Beyond that, > > there's precedence in the codebase for flatbuffers (which is just to say > > that flatbuffers > > is the devil we know). > > > > Can people list other concrete requirements for the format? A > > non-requirement might > > be that there be _idiomatic_ implementations for every language arrow > > supports, for example. > > > > I think without agreement on requirements we won't ever arrive at > > consensus. > > > > The compute IR spec itself doesn't really depend on the specific choice > of > > format, but we > > need to get some consensus on the format. > > > > > > > E.g. there is no async support for thrift, flatbuffers nor protobuf in > > > Rust, which e.g. means that we can't read neither parquet nor arrow IPC > > > async atm. These problems are usually easier to work around in simpler > > > formats. > > > > > > > Can you elaborate a bit on the lack of async support here and what it > would > > mean for > > a particular in-memory representation to support async, and why that > > prevents reading > > a parquet file using async? > > > > Looking at JSON as an example, most libraries in the Rust ecosystem use > > serde and serde_json > > to serialize and deserialize JSON, and any async concerns occur at the > > level of > > a client/server library like warp (or some transitive dependency thereof > > like Hyper). > > > > Are you referring to something like the functionality implemented in > > tokio-serde-json? If so, > > I think you could probably build something for these other formats > assuming > > they have serde > > support (flatbuffers notably does _not_, partially because of its > incessant > > need to own everything), > > since tokio_serde is
Re: [DISCUSS] Splitting out the Arrow format directory
Hi, The requirements for the compute IR as I see it are: > > * Implementations in IR producer and consumer languages. > * Strongly typed or the ability to easily validate a payload > What about: 1. easy to read and write by a large number of programming languages 2. easy to read and write by humans 3. fast to validate by a large number of programming languages I.e. make the ability to read and write by humans be more important than speed of validation. In this order, JSON/toml/yaml are preferred because they are supported by more languages and more human readable than faster to validate. - My understanding is that for an async experience, we need the ability to `.await` at any `read_X` call so that if the read_X requests more bytes than are buffered, the `read_X(...).await` triggers a new (async) request to fill the buffer (which puts the future on a Pending state). When a library does not offer the async version of `read_X`, any read_X can force a request to fill the buffer, which is now blocking the thread. One way around this is to wrap those blocking calls in async (e.g. via tokio::spawn_blocking). However, this forces users to use that runtime, or to create a new independent thread pool for their own async work. Neither are great for low-level libraries. E.g. thrift does not offer async -> parquet-format-rs does not offer async -> parquet does not offer async -> datafusion wraps all parquet "IO-bounded and CPU-bounded operations" in spawn_blocking or something equivalent. Best, Jorge On Thu, Aug 12, 2021 at 10:03 PM Phillip Cloud wrote: > On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > I agree with Antoine that we should weigh the pros and cons of > flatbuffers > > (or protobuf or thrift for that matter) over a more human-friendly, > > simpler, format like json or MsgPack. I also struggle a bit to reason > with > > the complexity of using flatbuffers for this. > > > > Ultimately I think different representations of the format will emerge if > compute IR is successful, > and people will implement JSON/proto/thrift/etc versions of the IR. > > The requirements for the compute IR as I see it are: > > * Implementations in IR producer and consumer languages. > * Strongly typed or the ability to easily validate a payload > > It seems like Protobuf, Flatbuffers and JSON all meet the criteria here. > Beyond that, > there's precedence in the codebase for flatbuffers (which is just to say > that flatbuffers > is the devil we know). > > Can people list other concrete requirements for the format? A > non-requirement might > be that there be _idiomatic_ implementations for every language arrow > supports, for example. > > I think without agreement on requirements we won't ever arrive at > consensus. > > The compute IR spec itself doesn't really depend on the specific choice of > format, but we > need to get some consensus on the format. > > > > E.g. there is no async support for thrift, flatbuffers nor protobuf in > > Rust, which e.g. means that we can't read neither parquet nor arrow IPC > > async atm. These problems are usually easier to work around in simpler > > formats. > > > > Can you elaborate a bit on the lack of async support here and what it would > mean for > a particular in-memory representation to support async, and why that > prevents reading > a parquet file using async? > > Looking at JSON as an example, most libraries in the Rust ecosystem use > serde and serde_json > to serialize and deserialize JSON, and any async concerns occur at the > level of > a client/server library like warp (or some transitive dependency thereof > like Hyper). > > Are you referring to something like the functionality implemented in > tokio-serde-json? If so, > I think you could probably build something for these other formats assuming > they have serde > support (flatbuffers notably does _not_, partially because of its incessant > need to own everything), > since tokio_serde is doing most of the work in tokio-serde-json. In any > case, I don't think > it's a requirement for the compute IR that there be a streaming transport > implementation for the > format. > > > > > > Best, > > Jorge > > > > > > > > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou > wrote: > > > > > > > > Le 12/08/2021 à 15:05, Wes McKinney a écrit : > > > > It seems that one adjacent problem here is how to make it simpler for > > > > third parties (especially ones that act as front end interfaces) to > > > > build and serialize/deserialize the IR structures with some kind of > > > > ready-to-go middleware library, written in a language like C++. > > > > > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go, > > > etc. developers. > > > > > > I'm not sure which design decision and set of compromises would make > the > > > most sense. But this is why I'm asking the question "why not JSON?" (+ > > > JSON-Schema if you want to ease validation by third parties). > > > > >
Re: [DISCUSS] Splitting out the Arrow format directory
On Thu, Aug 12, 2021 at 1:03 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > I agree with Antoine that we should weigh the pros and cons of flatbuffers > (or protobuf or thrift for that matter) over a more human-friendly, > simpler, format like json or MsgPack. I also struggle a bit to reason with > the complexity of using flatbuffers for this. > Ultimately I think different representations of the format will emerge if compute IR is successful, and people will implement JSON/proto/thrift/etc versions of the IR. The requirements for the compute IR as I see it are: * Implementations in IR producer and consumer languages. * Strongly typed or the ability to easily validate a payload It seems like Protobuf, Flatbuffers and JSON all meet the criteria here. Beyond that, there's precedence in the codebase for flatbuffers (which is just to say that flatbuffers is the devil we know). Can people list other concrete requirements for the format? A non-requirement might be that there be _idiomatic_ implementations for every language arrow supports, for example. I think without agreement on requirements we won't ever arrive at consensus. The compute IR spec itself doesn't really depend on the specific choice of format, but we need to get some consensus on the format. > E.g. there is no async support for thrift, flatbuffers nor protobuf in > Rust, which e.g. means that we can't read neither parquet nor arrow IPC > async atm. These problems are usually easier to work around in simpler > formats. > Can you elaborate a bit on the lack of async support here and what it would mean for a particular in-memory representation to support async, and why that prevents reading a parquet file using async? Looking at JSON as an example, most libraries in the Rust ecosystem use serde and serde_json to serialize and deserialize JSON, and any async concerns occur at the level of a client/server library like warp (or some transitive dependency thereof like Hyper). Are you referring to something like the functionality implemented in tokio-serde-json? If so, I think you could probably build something for these other formats assuming they have serde support (flatbuffers notably does _not_, partially because of its incessant need to own everything), since tokio_serde is doing most of the work in tokio-serde-json. In any case, I don't think it's a requirement for the compute IR that there be a streaming transport implementation for the format. > > Best, > Jorge > > > > On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou wrote: > > > > > Le 12/08/2021 à 15:05, Wes McKinney a écrit : > > > It seems that one adjacent problem here is how to make it simpler for > > > third parties (especially ones that act as front end interfaces) to > > > build and serialize/deserialize the IR structures with some kind of > > > ready-to-go middleware library, written in a language like C++. > > > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go, > > etc. developers. > > > > I'm not sure which design decision and set of compromises would make the > > most sense. But this is why I'm asking the question "why not JSON?" (+ > > JSON-Schema if you want to ease validation by third parties). > > > > (note I have already mentioned MsgPack, but only in the case a binary > > encoding is really required; it doesn't have any other advantage that I > > know of over JSON, and it's less ubiquitous) > > > > Regards > > > > Antoine. > > > > > > > To do that, one would need the equivalent of arrow/type.h and related > > > Flatbuffers schema serialization code that lives in arrow/ipc. If you > > > want to be able to completely and accurately serialize Schemas, you > > > need quite a bit of code now. > > > > > > One possible approach (and not go crazy) would be to: > > > > > > * Move arrow/types.h and its dependencies into a standalone C++ > > > library that can be vendored into the main apache/arrow C++ library. I > > > don't know how onerous arrow/types.h's transitive dependencies / > > > interactions are at this point (there's a lot of stuff going on in > > > type.cc [1] now) > > > * Make the namespaces exported by this library configurable, so any > > > library can vendor the Arrow types / IR builder APIs privately into > > > their project > > > * Maintain this "Arrow types and ComputeIR library" as an always > > > zero-dependency library to facilitate vendoring > > > * Lightweight bindings in languages we care about (like Python or R or > > > GLib/Ruby) could be built to the IR builder middleware library > > > > > > This seems like what is more at issue compared with rather projects > > > are copying the Flatbuffers files out of their project from > > > apache/arrow or apache/arrow-format. > > > > > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb > > wrote: > > >> > > >> I support the idea of an independent repo that has the arrow > flatbuffers > > >>
Re: [DISCUSS] Splitting out the Arrow format directory
I agree with Antoine that we should weigh the pros and cons of flatbuffers (or protobuf or thrift for that matter) over a more human-friendly, simpler, format like json or MsgPack. I also struggle a bit to reason with the complexity of using flatbuffers for this. E.g. there is no async support for thrift, flatbuffers nor protobuf in Rust, which e.g. means that we can't read neither parquet nor arrow IPC async atm. These problems are usually easier to work around in simpler formats. Best, Jorge On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou wrote: > > Le 12/08/2021 à 15:05, Wes McKinney a écrit : > > It seems that one adjacent problem here is how to make it simpler for > > third parties (especially ones that act as front end interfaces) to > > build and serialize/deserialize the IR structures with some kind of > > ready-to-go middleware library, written in a language like C++. > > A C++ library sounds a bit complicated to deal with for Java, Rust, Go, > etc. developers. > > I'm not sure which design decision and set of compromises would make the > most sense. But this is why I'm asking the question "why not JSON?" (+ > JSON-Schema if you want to ease validation by third parties). > > (note I have already mentioned MsgPack, but only in the case a binary > encoding is really required; it doesn't have any other advantage that I > know of over JSON, and it's less ubiquitous) > > Regards > > Antoine. > > > > To do that, one would need the equivalent of arrow/type.h and related > > Flatbuffers schema serialization code that lives in arrow/ipc. If you > > want to be able to completely and accurately serialize Schemas, you > > need quite a bit of code now. > > > > One possible approach (and not go crazy) would be to: > > > > * Move arrow/types.h and its dependencies into a standalone C++ > > library that can be vendored into the main apache/arrow C++ library. I > > don't know how onerous arrow/types.h's transitive dependencies / > > interactions are at this point (there's a lot of stuff going on in > > type.cc [1] now) > > * Make the namespaces exported by this library configurable, so any > > library can vendor the Arrow types / IR builder APIs privately into > > their project > > * Maintain this "Arrow types and ComputeIR library" as an always > > zero-dependency library to facilitate vendoring > > * Lightweight bindings in languages we care about (like Python or R or > > GLib/Ruby) could be built to the IR builder middleware library > > > > This seems like what is more at issue compared with rather projects > > are copying the Flatbuffers files out of their project from > > apache/arrow or apache/arrow-format. > > > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb > wrote: > >> > >> I support the idea of an independent repo that has the arrow flatbuffers > >> format definition files. > >> > >> My rationale is that the Rust implementation has a copy of the `format` > >> directory [1] and potential drift worries me (a bit). Having a single > >> source of truth for the format that is not part of the large mono repo > >> would be a good thing. > >> > >> Andrew > >> > >> [1] https://github.com/apache/arrow-rs/tree/master/format > >> > >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud > wrote: > >> > >>> Hi all, > >>> > >>> I'd like to bring up an idea from a recent thread ([1]) about moving > the > >>> `format/` directory out of the primary apache/arrow repository. > >>> > >>> I understand from that thread there are some concerns about using > >>> submodules, > >>> and I definitely sympathize with those concerns. > >>> > >>> In talking with David Li (disclaimer: we work together at Voltron > Data), he > >>> has > >>> a great idea that I think makes everyone happy: an > `apache/arrow-format` > >>> repository that is the official mirror for the flatbuffers IDL, that > >>> library > >>> authors should use as the source of truth. > >>> > >>> It doesn't require a submodule, yet it also allows external projects > the > >>> ability to access the IDL without having to interact with the main > arrow > >>> repository and is backwards compatible to boot. > >>> > >>> In this scenario, repositories that are currently copying in the > >>> flatbuffers > >>> IDL can migrate to this repository at their leisure. > >>> > >>> My motivation for this was around sharing data structures for the > compute > >>> IR > >>> proposal ([2]). > >>> > >>> I can think of at least two ways for IR producers and consumers of all > >>> languages to share the flatbuffers IDL: > >>> > >>> 1. A set of bindings built in some language that other languages can > >>> integrate > >>> with, likely C++, that allows library users to build IR using an > API. > >>> > >>> The primary downside to this is that we'd have to deal with > >>> building another library while working out any kinks in the IR design > and > >>> I'd > >>> rather avoid that in the initial phases of this
Re: [DISCUSS] Splitting out the Arrow format directory
Le 12/08/2021 à 15:05, Wes McKinney a écrit : It seems that one adjacent problem here is how to make it simpler for third parties (especially ones that act as front end interfaces) to build and serialize/deserialize the IR structures with some kind of ready-to-go middleware library, written in a language like C++. A C++ library sounds a bit complicated to deal with for Java, Rust, Go, etc. developers. I'm not sure which design decision and set of compromises would make the most sense. But this is why I'm asking the question "why not JSON?" (+ JSON-Schema if you want to ease validation by third parties). (note I have already mentioned MsgPack, but only in the case a binary encoding is really required; it doesn't have any other advantage that I know of over JSON, and it's less ubiquitous) Regards Antoine. To do that, one would need the equivalent of arrow/type.h and related Flatbuffers schema serialization code that lives in arrow/ipc. If you want to be able to completely and accurately serialize Schemas, you need quite a bit of code now. One possible approach (and not go crazy) would be to: * Move arrow/types.h and its dependencies into a standalone C++ library that can be vendored into the main apache/arrow C++ library. I don't know how onerous arrow/types.h's transitive dependencies / interactions are at this point (there's a lot of stuff going on in type.cc [1] now) * Make the namespaces exported by this library configurable, so any library can vendor the Arrow types / IR builder APIs privately into their project * Maintain this "Arrow types and ComputeIR library" as an always zero-dependency library to facilitate vendoring * Lightweight bindings in languages we care about (like Python or R or GLib/Ruby) could be built to the IR builder middleware library This seems like what is more at issue compared with rather projects are copying the Flatbuffers files out of their project from apache/arrow or apache/arrow-format. [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb wrote: I support the idea of an independent repo that has the arrow flatbuffers format definition files. My rationale is that the Rust implementation has a copy of the `format` directory [1] and potential drift worries me (a bit). Having a single source of truth for the format that is not part of the large mono repo would be a good thing. Andrew [1] https://github.com/apache/arrow-rs/tree/master/format On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: Hi all, I'd like to bring up an idea from a recent thread ([1]) about moving the `format/` directory out of the primary apache/arrow repository. I understand from that thread there are some concerns about using submodules, and I definitely sympathize with those concerns. In talking with David Li (disclaimer: we work together at Voltron Data), he has a great idea that I think makes everyone happy: an `apache/arrow-format` repository that is the official mirror for the flatbuffers IDL, that library authors should use as the source of truth. It doesn't require a submodule, yet it also allows external projects the ability to access the IDL without having to interact with the main arrow repository and is backwards compatible to boot. In this scenario, repositories that are currently copying in the flatbuffers IDL can migrate to this repository at their leisure. My motivation for this was around sharing data structures for the compute IR proposal ([2]). I can think of at least two ways for IR producers and consumers of all languages to share the flatbuffers IDL: 1. A set of bindings built in some language that other languages can integrate with, likely C++, that allows library users to build IR using an API. The primary downside to this is that we'd have to deal with building another library while working out any kinks in the IR design and I'd rather avoid that in the initial phases of this project. The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. 2. A single location where the format lives, that doesn't require depending on a large multi-language repository to access a handful of files. I think the downside to this is that there's a bit of additional infrastructure to automate copying in `arrow-format`. The benefit there is that producers and consumers can immediately start getting value from compute IR without having to wait for development of a new API. One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's type information contained in `Schema.fbs`. I was hoping to avoid conflating the discussion about bindings vs direct flatbuffer usage (at least initially just supporting one, I predict we'll need both ultimately) with the decision about whether to split out the format directory, but it's a good example of a choice for which splitting out
Re: [DISCUSS] Splitting out the Arrow format directory
On Thu, Aug 12, 2021 at 3:16 PM Neal Richardson wrote: > > > Maintain this "Arrow types and ComputeIR library" as an always > zero-dependency library to facilitate vendoring > > Would/should this hypothetical zero-dep, vendorable library also include > the IPC format? Or if you want to interact with IPC in that case, the C > data interface is the best/only option? No, to do anything with the IPC format would pull in arrow::Buffer, arrow::Array, and many other inextricable components which are used with the IPC read/write implementation. > Or if you want to interact with IPC in that case, the C data interface is the > best/only option? I'm not clear on what you mean since the C data interface is only for data interchange at function call sites in-process, and not for serialization (interprocess). > On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney wrote: > > > It seems that one adjacent problem here is how to make it simpler for > > third parties (especially ones that act as front end interfaces) to > > build and serialize/deserialize the IR structures with some kind of > > ready-to-go middleware library, written in a language like C++. > > > > To do that, one would need the equivalent of arrow/type.h and related > > Flatbuffers schema serialization code that lives in arrow/ipc. If you > > want to be able to completely and accurately serialize Schemas, you > > need quite a bit of code now. > > > > One possible approach (and not go crazy) would be to: > > > > * Move arrow/types.h and its dependencies into a standalone C++ > > library that can be vendored into the main apache/arrow C++ library. I > > don't know how onerous arrow/types.h's transitive dependencies / > > interactions are at this point (there's a lot of stuff going on in > > type.cc [1] now) > > * Make the namespaces exported by this library configurable, so any > > library can vendor the Arrow types / IR builder APIs privately into > > their project > > * Maintain this "Arrow types and ComputeIR library" as an always > > zero-dependency library to facilitate vendoring > > * Lightweight bindings in languages we care about (like Python or R or > > GLib/Ruby) could be built to the IR builder middleware library > > > > This seems like what is more at issue compared with rather projects > > are copying the Flatbuffers files out of their project from > > apache/arrow or apache/arrow-format. > > > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb wrote: > > > > > > I support the idea of an independent repo that has the arrow flatbuffers > > > format definition files. > > > > > > My rationale is that the Rust implementation has a copy of the `format` > > > directory [1] and potential drift worries me (a bit). Having a single > > > source of truth for the format that is not part of the large mono repo > > > would be a good thing. > > > > > > Andrew > > > > > > [1] https://github.com/apache/arrow-rs/tree/master/format > > > > > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: > > > > > > > Hi all, > > > > > > > > I'd like to bring up an idea from a recent thread ([1]) about moving > > the > > > > `format/` directory out of the primary apache/arrow repository. > > > > > > > > I understand from that thread there are some concerns about using > > > > submodules, > > > > and I definitely sympathize with those concerns. > > > > > > > > In talking with David Li (disclaimer: we work together at Voltron > > Data), he > > > > has > > > > a great idea that I think makes everyone happy: an > > `apache/arrow-format` > > > > repository that is the official mirror for the flatbuffers IDL, that > > > > library > > > > authors should use as the source of truth. > > > > > > > > It doesn't require a submodule, yet it also allows external projects > > the > > > > ability to access the IDL without having to interact with the main > > arrow > > > > repository and is backwards compatible to boot. > > > > > > > > In this scenario, repositories that are currently copying in the > > > > flatbuffers > > > > IDL can migrate to this repository at their leisure. > > > > > > > > My motivation for this was around sharing data structures for the > > compute > > > > IR > > > > proposal ([2]). > > > > > > > > I can think of at least two ways for IR producers and consumers of all > > > > languages to share the flatbuffers IDL: > > > > > > > > 1. A set of bindings built in some language that other languages can > > > > integrate > > > >with, likely C++, that allows library users to build IR using an > > API. > > > > > > > > The primary downside to this is that we'd have to deal with > > > > building another library while working out any kinks in the IR design > > and > > > > I'd > > > > rather avoid that in the initial phases of this project. > > > > > > > > The benefit is that IR components don't interact much with > > `flatbuffers` or > > > > `flatc` directly. > > > > > > > > 2. A single location
Re: [DISCUSS] Splitting out the Arrow format directory
> Maintain this "Arrow types and ComputeIR library" as an always zero-dependency library to facilitate vendoring Would/should this hypothetical zero-dep, vendorable library also include the IPC format? Or if you want to interact with IPC in that case, the C data interface is the best/only option? On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney wrote: > It seems that one adjacent problem here is how to make it simpler for > third parties (especially ones that act as front end interfaces) to > build and serialize/deserialize the IR structures with some kind of > ready-to-go middleware library, written in a language like C++. > > To do that, one would need the equivalent of arrow/type.h and related > Flatbuffers schema serialization code that lives in arrow/ipc. If you > want to be able to completely and accurately serialize Schemas, you > need quite a bit of code now. > > One possible approach (and not go crazy) would be to: > > * Move arrow/types.h and its dependencies into a standalone C++ > library that can be vendored into the main apache/arrow C++ library. I > don't know how onerous arrow/types.h's transitive dependencies / > interactions are at this point (there's a lot of stuff going on in > type.cc [1] now) > * Make the namespaces exported by this library configurable, so any > library can vendor the Arrow types / IR builder APIs privately into > their project > * Maintain this "Arrow types and ComputeIR library" as an always > zero-dependency library to facilitate vendoring > * Lightweight bindings in languages we care about (like Python or R or > GLib/Ruby) could be built to the IR builder middleware library > > This seems like what is more at issue compared with rather projects > are copying the Flatbuffers files out of their project from > apache/arrow or apache/arrow-format. > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb wrote: > > > > I support the idea of an independent repo that has the arrow flatbuffers > > format definition files. > > > > My rationale is that the Rust implementation has a copy of the `format` > > directory [1] and potential drift worries me (a bit). Having a single > > source of truth for the format that is not part of the large mono repo > > would be a good thing. > > > > Andrew > > > > [1] https://github.com/apache/arrow-rs/tree/master/format > > > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: > > > > > Hi all, > > > > > > I'd like to bring up an idea from a recent thread ([1]) about moving > the > > > `format/` directory out of the primary apache/arrow repository. > > > > > > I understand from that thread there are some concerns about using > > > submodules, > > > and I definitely sympathize with those concerns. > > > > > > In talking with David Li (disclaimer: we work together at Voltron > Data), he > > > has > > > a great idea that I think makes everyone happy: an > `apache/arrow-format` > > > repository that is the official mirror for the flatbuffers IDL, that > > > library > > > authors should use as the source of truth. > > > > > > It doesn't require a submodule, yet it also allows external projects > the > > > ability to access the IDL without having to interact with the main > arrow > > > repository and is backwards compatible to boot. > > > > > > In this scenario, repositories that are currently copying in the > > > flatbuffers > > > IDL can migrate to this repository at their leisure. > > > > > > My motivation for this was around sharing data structures for the > compute > > > IR > > > proposal ([2]). > > > > > > I can think of at least two ways for IR producers and consumers of all > > > languages to share the flatbuffers IDL: > > > > > > 1. A set of bindings built in some language that other languages can > > > integrate > > >with, likely C++, that allows library users to build IR using an > API. > > > > > > The primary downside to this is that we'd have to deal with > > > building another library while working out any kinks in the IR design > and > > > I'd > > > rather avoid that in the initial phases of this project. > > > > > > The benefit is that IR components don't interact much with > `flatbuffers` or > > > `flatc` directly. > > > > > > 2. A single location where the format lives, that doesn't require > depending > > > on > > >a large multi-language repository to access a handful of files. > > > > > > I think the downside to this is that there's a bit of additional > > > infrastructure > > > to automate copying in `arrow-format`. > > > > > > The benefit there is that producers and consumers can immediately start > > > getting > > > value from compute IR without having to wait for development of a new > API. > > > > > > One counter-proposal might be to just put the compute IR IDL in a > separate > > > repo, > > > but that isn't tenable because the compute IR needs arrow's type > > > information > > > contained in `Schema.fbs`. > > > > > > I was hoping to avoid
Re: [DISCUSS] Splitting out the Arrow format directory
On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney wrote: > It seems that one adjacent problem here is how to make it simpler for > third parties (especially ones that act as front end interfaces) to > build and serialize/deserialize the IR structures with some kind of > ready-to-go middleware library, written in a language like C++. > > To do that, one would need the equivalent of arrow/type.h and related > Flatbuffers schema serialization code that lives in arrow/ipc. If you > want to be able to completely and accurately serialize Schemas, you > need quite a bit of code now. > > One possible approach (and not go crazy) would be to: > > * Move arrow/types.h and its dependencies into a standalone C++ > library that can be vendored into the main apache/arrow C++ library. I > don't know how onerous arrow/types.h's transitive dependencies / > interactions are at this point (there's a lot of stuff going on in > type.cc [1] now) > * Make the namespaces exported by this library configurable, so any > library can vendor the Arrow types / IR builder APIs privately into > their project > * Maintain this "Arrow types and ComputeIR library" as an always > zero-dependency library to facilitate vendoring > * Lightweight bindings in languages we care about (like Python or R or > GLib/Ruby) could be built to the IR builder middleware library > > This seems like what is more at issue compared with rather projects > are copying the Flatbuffers files out of their project from > apache/arrow or apache/arrow-format. I was hoping we could avoid doing something like this until there's a clear need for it in the interest of not spending a huge amount of time on adjacent dependency management work. My thinking is that the primary effort should be around solidifying the IR design, and not making it impossible for folks to test but also not spending a bunch of time up front building a middleware library. I think the use case of simplifying external consumption of the arrow format might even deserve its own dedicated mailing list thread. > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb wrote: > > > > I support the idea of an independent repo that has the arrow flatbuffers > > format definition files. > > > > My rationale is that the Rust implementation has a copy of the `format` > > directory [1] and potential drift worries me (a bit). Having a single > > source of truth for the format that is not part of the large mono repo > > would be a good thing. > > > > Andrew > > > > [1] https://github.com/apache/arrow-rs/tree/master/format > > > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: > > > > > Hi all, > > > > > > I'd like to bring up an idea from a recent thread ([1]) about moving > the > > > `format/` directory out of the primary apache/arrow repository. > > > > > > I understand from that thread there are some concerns about using > > > submodules, > > > and I definitely sympathize with those concerns. > > > > > > In talking with David Li (disclaimer: we work together at Voltron > Data), he > > > has > > > a great idea that I think makes everyone happy: an > `apache/arrow-format` > > > repository that is the official mirror for the flatbuffers IDL, that > > > library > > > authors should use as the source of truth. > > > > > > It doesn't require a submodule, yet it also allows external projects > the > > > ability to access the IDL without having to interact with the main > arrow > > > repository and is backwards compatible to boot. > > > > > > In this scenario, repositories that are currently copying in the > > > flatbuffers > > > IDL can migrate to this repository at their leisure. > > > > > > My motivation for this was around sharing data structures for the > compute > > > IR > > > proposal ([2]). > > > > > > I can think of at least two ways for IR producers and consumers of all > > > languages to share the flatbuffers IDL: > > > > > > 1. A set of bindings built in some language that other languages can > > > integrate > > >with, likely C++, that allows library users to build IR using an > API. > > > > > > The primary downside to this is that we'd have to deal with > > > building another library while working out any kinks in the IR design > and > > > I'd > > > rather avoid that in the initial phases of this project. > > > > > > The benefit is that IR components don't interact much with > `flatbuffers` or > > > `flatc` directly. > > > > > > 2. A single location where the format lives, that doesn't require > depending > > > on > > >a large multi-language repository to access a handful of files. > > > > > > I think the downside to this is that there's a bit of additional > > > infrastructure > > > to automate copying in `arrow-format`. > > > > > > The benefit there is that producers and consumers can immediately start > > > getting > > > value from compute IR without having to wait for development of a new > API. > > > > > > One
Re: [DISCUSS] Splitting out the Arrow format directory
It seems that one adjacent problem here is how to make it simpler for third parties (especially ones that act as front end interfaces) to build and serialize/deserialize the IR structures with some kind of ready-to-go middleware library, written in a language like C++. To do that, one would need the equivalent of arrow/type.h and related Flatbuffers schema serialization code that lives in arrow/ipc. If you want to be able to completely and accurately serialize Schemas, you need quite a bit of code now. One possible approach (and not go crazy) would be to: * Move arrow/types.h and its dependencies into a standalone C++ library that can be vendored into the main apache/arrow C++ library. I don't know how onerous arrow/types.h's transitive dependencies / interactions are at this point (there's a lot of stuff going on in type.cc [1] now) * Make the namespaces exported by this library configurable, so any library can vendor the Arrow types / IR builder APIs privately into their project * Maintain this "Arrow types and ComputeIR library" as an always zero-dependency library to facilitate vendoring * Lightweight bindings in languages we care about (like Python or R or GLib/Ruby) could be built to the IR builder middleware library This seems like what is more at issue compared with rather projects are copying the Flatbuffers files out of their project from apache/arrow or apache/arrow-format. [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb wrote: > > I support the idea of an independent repo that has the arrow flatbuffers > format definition files. > > My rationale is that the Rust implementation has a copy of the `format` > directory [1] and potential drift worries me (a bit). Having a single > source of truth for the format that is not part of the large mono repo > would be a good thing. > > Andrew > > [1] https://github.com/apache/arrow-rs/tree/master/format > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: > > > Hi all, > > > > I'd like to bring up an idea from a recent thread ([1]) about moving the > > `format/` directory out of the primary apache/arrow repository. > > > > I understand from that thread there are some concerns about using > > submodules, > > and I definitely sympathize with those concerns. > > > > In talking with David Li (disclaimer: we work together at Voltron Data), he > > has > > a great idea that I think makes everyone happy: an `apache/arrow-format` > > repository that is the official mirror for the flatbuffers IDL, that > > library > > authors should use as the source of truth. > > > > It doesn't require a submodule, yet it also allows external projects the > > ability to access the IDL without having to interact with the main arrow > > repository and is backwards compatible to boot. > > > > In this scenario, repositories that are currently copying in the > > flatbuffers > > IDL can migrate to this repository at their leisure. > > > > My motivation for this was around sharing data structures for the compute > > IR > > proposal ([2]). > > > > I can think of at least two ways for IR producers and consumers of all > > languages to share the flatbuffers IDL: > > > > 1. A set of bindings built in some language that other languages can > > integrate > >with, likely C++, that allows library users to build IR using an API. > > > > The primary downside to this is that we'd have to deal with > > building another library while working out any kinks in the IR design and > > I'd > > rather avoid that in the initial phases of this project. > > > > The benefit is that IR components don't interact much with `flatbuffers` or > > `flatc` directly. > > > > 2. A single location where the format lives, that doesn't require depending > > on > >a large multi-language repository to access a handful of files. > > > > I think the downside to this is that there's a bit of additional > > infrastructure > > to automate copying in `arrow-format`. > > > > The benefit there is that producers and consumers can immediately start > > getting > > value from compute IR without having to wait for development of a new API. > > > > One counter-proposal might be to just put the compute IR IDL in a separate > > repo, > > but that isn't tenable because the compute IR needs arrow's type > > information > > contained in `Schema.fbs`. > > > > I was hoping to avoid conflating the discussion about bindings vs direct > > flatbuffer usage (at least initially just supporting one, I predict we'll > > need > > both ultimately) with the decision about whether to split out the format > > directory, but it's a good example of a choice for which splitting out the > > format directory would be well-served. > > > > I'll note that this doesn't block anything on the compute IR side, just > > wanted > > to surface this in a parallel thread and see what folks think. > > > > [1]: > > > >
Re: [DISCUSS] Splitting out the Arrow format directory
I support the idea of an independent repo that has the arrow flatbuffers format definition files. My rationale is that the Rust implementation has a copy of the `format` directory [1] and potential drift worries me (a bit). Having a single source of truth for the format that is not part of the large mono repo would be a good thing. Andrew [1] https://github.com/apache/arrow-rs/tree/master/format On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud wrote: > Hi all, > > I'd like to bring up an idea from a recent thread ([1]) about moving the > `format/` directory out of the primary apache/arrow repository. > > I understand from that thread there are some concerns about using > submodules, > and I definitely sympathize with those concerns. > > In talking with David Li (disclaimer: we work together at Voltron Data), he > has > a great idea that I think makes everyone happy: an `apache/arrow-format` > repository that is the official mirror for the flatbuffers IDL, that > library > authors should use as the source of truth. > > It doesn't require a submodule, yet it also allows external projects the > ability to access the IDL without having to interact with the main arrow > repository and is backwards compatible to boot. > > In this scenario, repositories that are currently copying in the > flatbuffers > IDL can migrate to this repository at their leisure. > > My motivation for this was around sharing data structures for the compute > IR > proposal ([2]). > > I can think of at least two ways for IR producers and consumers of all > languages to share the flatbuffers IDL: > > 1. A set of bindings built in some language that other languages can > integrate >with, likely C++, that allows library users to build IR using an API. > > The primary downside to this is that we'd have to deal with > building another library while working out any kinks in the IR design and > I'd > rather avoid that in the initial phases of this project. > > The benefit is that IR components don't interact much with `flatbuffers` or > `flatc` directly. > > 2. A single location where the format lives, that doesn't require depending > on >a large multi-language repository to access a handful of files. > > I think the downside to this is that there's a bit of additional > infrastructure > to automate copying in `arrow-format`. > > The benefit there is that producers and consumers can immediately start > getting > value from compute IR without having to wait for development of a new API. > > One counter-proposal might be to just put the compute IR IDL in a separate > repo, > but that isn't tenable because the compute IR needs arrow's type > information > contained in `Schema.fbs`. > > I was hoping to avoid conflating the discussion about bindings vs direct > flatbuffer usage (at least initially just supporting one, I predict we'll > need > both ultimately) with the decision about whether to split out the format > directory, but it's a good example of a choice for which splitting out the > format directory would be well-served. > > I'll note that this doesn't block anything on the compute IR side, just > wanted > to surface this in a parallel thread and see what folks think. > > [1]: > > https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E > [2]: > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l >
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021, 19:05 Weston Pace wrote: > >> The benefit is that IR components don't interact much with > `flatbuffers` or > >> `flatc` directly. > >> > [...] > >> > >> One counter-proposal might be to just put the compute IR IDL in a > separate > >> repo, > >> but that isn't tenable because the compute IR needs arrow's type > information > >> contained in `Schema.fbs`. > > > This argument seems predated on the hypothesis that the compute IR will > > use Flatbuffers. Is it set in stone? > > +1 for the original proposal (mirror repo for specs). I don't think > we have to figure out the IR format. It makes sense for all language > independent specs to be in a single place regardless of format. If IR > picked JSON I would still argue the JSON schemas for IR belong in the > same repository as the Arrow columnar format flatbuffers files. It > makes it clear what is spec and what is implementation / toolkit. > Especially since a mirror repo should be pretty low maintenance. > That's a good point. I hadn't considered that point of view, but I think you're right that specs, regardless of wire format should remain together. > On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou > wrote: > > > > > > Le 11/08/2021 à 23:06, Phillip Cloud a écrit : > > > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou > wrote: > > > > > >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit : > > >>> > > >>> Yeah, that is a drawback here, though I don't see needing to run > flatc > > >> as a > > >>> major downside given the upside > > >>> of not having to write additional code to move between formats. > > >> > > >> That's only an advantage if you already know how to read the Arrow IPC > > >> format (and, yes, in this case you already run `flatc`). Some > projects > > >> probably don't care about Arrow IPC (Dask, for example). > > > > > > > > > I don't think it's about the IPC though, at least for the compute IR > use > > > case. > > > Am I missing something there? > > > > If you're not handling the Arrow IPC format, then you probably don't > > have an encoder/decoder for Schema.fbs, so the "upside of not having to > > write additional code to move between formats" doesn't exist (unless I'm > > misunderstanding your point?). > > > > > I do think a downside of not using something like JSON or msgpack is > > > that schema validation must be implemented by both the producer and the > > > consumer. > > > That means we'd have a number of other consequential decisions to make: > > > > > > * Do we provide the validation library? > > > * If not, do all the languages arrow supports have high quality > libraries > > > for validating schemas? > > > * If so, then we have to implement/maintain/release/bugfix that. > > > > This is true. However, Flatbuffers doesn't validate much on its own, > > either, because its IDL is not expressive enough. For example, > > `Schema.fbs` allows you to declare a INT8 field with children, a LIST > > field without any children, a non-nullable NULL field... > > > > (also, there's JSON Schema: https://json-schema.org/) > > > > Regards > > > > Antoine. >
Re: [DISCUSS] Splitting out the Arrow format directory
>> The benefit is that IR components don't interact much with `flatbuffers` or >> `flatc` directly. >> [...] >> >> One counter-proposal might be to just put the compute IR IDL in a separate >> repo, >> but that isn't tenable because the compute IR needs arrow's type information >> contained in `Schema.fbs`. > This argument seems predated on the hypothesis that the compute IR will > use Flatbuffers. Is it set in stone? +1 for the original proposal (mirror repo for specs). I don't think we have to figure out the IR format. It makes sense for all language independent specs to be in a single place regardless of format. If IR picked JSON I would still argue the JSON schemas for IR belong in the same repository as the Arrow columnar format flatbuffers files. It makes it clear what is spec and what is implementation / toolkit. Especially since a mirror repo should be pretty low maintenance. On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou wrote: > > > Le 11/08/2021 à 23:06, Phillip Cloud a écrit : > > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: > > > >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit : > >>> > >>> Yeah, that is a drawback here, though I don't see needing to run flatc > >> as a > >>> major downside given the upside > >>> of not having to write additional code to move between formats. > >> > >> That's only an advantage if you already know how to read the Arrow IPC > >> format (and, yes, in this case you already run `flatc`). Some projects > >> probably don't care about Arrow IPC (Dask, for example). > > > > > > I don't think it's about the IPC though, at least for the compute IR use > > case. > > Am I missing something there? > > If you're not handling the Arrow IPC format, then you probably don't > have an encoder/decoder for Schema.fbs, so the "upside of not having to > write additional code to move between formats" doesn't exist (unless I'm > misunderstanding your point?). > > > I do think a downside of not using something like JSON or msgpack is > > that schema validation must be implemented by both the producer and the > > consumer. > > That means we'd have a number of other consequential decisions to make: > > > > * Do we provide the validation library? > > * If not, do all the languages arrow supports have high quality libraries > > for validating schemas? > > * If so, then we have to implement/maintain/release/bugfix that. > > This is true. However, Flatbuffers doesn't validate much on its own, > either, because its IDL is not expressive enough. For example, > `Schema.fbs` allows you to declare a INT8 field with children, a LIST > field without any children, a non-nullable NULL field... > > (also, there's JSON Schema: https://json-schema.org/) > > Regards > > Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 23:06, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additional code to move between formats. That's only an advantage if you already know how to read the Arrow IPC format (and, yes, in this case you already run `flatc`). Some projects probably don't care about Arrow IPC (Dask, for example). I don't think it's about the IPC though, at least for the compute IR use case. Am I missing something there? If you're not handling the Arrow IPC format, then you probably don't have an encoder/decoder for Schema.fbs, so the "upside of not having to write additional code to move between formats" doesn't exist (unless I'm misunderstanding your point?). I do think a downside of not using something like JSON or msgpack is that schema validation must be implemented by both the producer and the consumer. That means we'd have a number of other consequential decisions to make: * Do we provide the validation library? * If not, do all the languages arrow supports have high quality libraries for validating schemas? * If so, then we have to implement/maintain/release/bugfix that. This is true. However, Flatbuffers doesn't validate much on its own, either, because its IDL is not expressive enough. For example, `Schema.fbs` allows you to declare a INT8 field with children, a LIST field without any children, a non-nullable NULL field... (also, there's JSON Schema: https://json-schema.org/) Regards Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021 at 4:21 PM David Li wrote: > If the worry is public distribution (i.e. requiring all downstream > projects to also run flatc in their builds) we could perhaps ship a package > that just consists of the generated code (though that's definitely more > packaging burden, and won't help when you're doing development against > in-progress or unreleased changes). > > -David > Arrow need not take on yet another packaging burden here: library authors can run flatc during development and release cycles, and ship that code alongside (whatever that means for the specific language) their library code. End users of, say, ibis never need to think about having flatc around. > > On Wed, Aug 11, 2021, at 16:16, Phillip Cloud wrote: > > On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou > wrote: > > > > > > > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit : > > > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou > > > wrote: > > > > > > > >> > > > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > > > >>> I can see how that might be a bit circular. Let me start from the > > > >>> perspective of requirements. We want to be able to reuse the > arrow's > > > >> types > > > >>> and schema, without having to write additional code to move back > and > > > >> forth > > > >>> between compute IR and not-compute-IR. I think that leaves only > > > >> flatbuffers > > > >>> as an option. > > > >> > > > >> If that's the case then agreed (well, you can always embed as a raw > > > >> bytestring in other formats, but that wouldn't be pretty). > > > >> > > > >> I just wonder what the complexity of using Flatbuffers is for e.g. > > > Python. > > > >> > > > > > > > > IMO the complexity isn't high, but the generated code is definitely > not > > > > idiomatic ( > > > > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html > ) > > > > > > Wow. And you also have to integrate `flatc` in your build chain? > > > > > > > Yeah, that is a drawback here, though I don't see needing to run flatc > as a > > major downside given the upside > > of not having to write additional code to move between formats. > > > > Is there something particularly onerous about needing to run a codegen > step > > in a build process > > (other than it being build-step number 1000 in a death by 1000 > build-steps > > scenario)? > > > > > > > > > > IMHO that compares poorly to JSON or MsgPack, for example. > > > > > > Regards > > > > > > Antoine. > > > > > >
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:16, Phillip Cloud a écrit : > > > > Yeah, that is a drawback here, though I don't see needing to run flatc > as a > > major downside given the upside > > of not having to write additional code to move between formats. > > That's only an advantage if you already know how to read the Arrow IPC > format (and, yes, in this case you already run `flatc`). Some projects > probably don't care about Arrow IPC (Dask, for example). I don't think it's about the IPC though, at least for the compute IR use case. Am I missing something there? I do think a downside of not using something like JSON or msgpack is that schema validation must be implemented by both the producer and the consumer. That means we'd have a number of other consequential decisions to make: * Do we provide the validation library? * If not, do all the languages arrow supports have high quality libraries for validating schemas? * If so, then we have to implement/maintain/release/bugfix that. This isn't the case with fb or protos since they have done the work to produce valid schemas by definition. > > > Is there something particularly onerous about needing to run a codegen > step > > in a build process > > (other than it being build-step number 1000 in a death by 1000 > build-steps > > scenario)? > > Most Python packages (except perhaps Numpy, Pandas, PyArrow...) have a > very simple build configuration. Adding an external command in the mix > (that needs a non-standard dependency) isn't trivial. > I don't find this too compelling. One language's lack of modern dependency management tooling and refusal to make it easy to run external tools during that process doesn't seem like a strong reason to rule out flatbuffers here. I want to support everyone as best we can, but any choice we make here will have some tradeoffs. I see not being able to share the exact same schema and type information as a huge downside relative to the cost of having to run a binary during a build process. To be clear, users should _definitely_ not be running flatc, it's only library authors that should be running it as part of a development/build/release cycle. > > Regards > > Antoine. >
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 22:20, David Li a écrit : If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help when you're doing development against in-progress or unreleased changes). Yes, we can do that. And in this case, we can even probably hide the Flatbuffers objects behind a more idiomatic API (such as nested dicts in Python). Regards Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 22:16, Phillip Cloud a écrit : Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additional code to move between formats. That's only an advantage if you already know how to read the Arrow IPC format (and, yes, in this case you already run `flatc`). Some projects probably don't care about Arrow IPC (Dask, for example). Is there something particularly onerous about needing to run a codegen step in a build process (other than it being build-step number 1000 in a death by 1000 build-steps scenario)? Most Python packages (except perhaps Numpy, Pandas, PyArrow...) have a very simple build configuration. Adding an external command in the mix (that needs a non-standard dependency) isn't trivial. Regards Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
If the worry is public distribution (i.e. requiring all downstream projects to also run flatc in their builds) we could perhaps ship a package that just consists of the generated code (though that's definitely more packaging burden, and won't help when you're doing development against in-progress or unreleased changes). -David On Wed, Aug 11, 2021, at 16:16, Phillip Cloud wrote: > On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou wrote: > > > > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit : > > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou > > wrote: > > > > > >> > > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > > >>> I can see how that might be a bit circular. Let me start from the > > >>> perspective of requirements. We want to be able to reuse the arrow's > > >> types > > >>> and schema, without having to write additional code to move back and > > >> forth > > >>> between compute IR and not-compute-IR. I think that leaves only > > >> flatbuffers > > >>> as an option. > > >> > > >> If that's the case then agreed (well, you can always embed as a raw > > >> bytestring in other formats, but that wouldn't be pretty). > > >> > > >> I just wonder what the complexity of using Flatbuffers is for e.g. > > Python. > > >> > > > > > > IMO the complexity isn't high, but the generated code is definitely not > > > idiomatic ( > > > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html) > > > > Wow. And you also have to integrate `flatc` in your build chain? > > > > Yeah, that is a drawback here, though I don't see needing to run flatc as a > major downside given the upside > of not having to write additional code to move between formats. > > Is there something particularly onerous about needing to run a codegen step > in a build process > (other than it being build-step number 1000 in a death by 1000 build-steps > scenario)? > > > > > > IMHO that compares poorly to JSON or MsgPack, for example. > > > > Regards > > > > Antoine. > > >
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021 at 4:05 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 22:02, Phillip Cloud a écrit : > > On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou > wrote: > > > >> > >> Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > >>> I can see how that might be a bit circular. Let me start from the > >>> perspective of requirements. We want to be able to reuse the arrow's > >> types > >>> and schema, without having to write additional code to move back and > >> forth > >>> between compute IR and not-compute-IR. I think that leaves only > >> flatbuffers > >>> as an option. > >> > >> If that's the case then agreed (well, you can always embed as a raw > >> bytestring in other formats, but that wouldn't be pretty). > >> > >> I just wonder what the complexity of using Flatbuffers is for e.g. > Python. > >> > > > > IMO the complexity isn't high, but the generated code is definitely not > > idiomatic ( > > https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html) > > Wow. And you also have to integrate `flatc` in your build chain? > Yeah, that is a drawback here, though I don't see needing to run flatc as a major downside given the upside of not having to write additional code to move between formats. Is there something particularly onerous about needing to run a codegen step in a build process (other than it being build-step number 1000 in a death by 1000 build-steps scenario)? > > IMHO that compares poorly to JSON or MsgPack, for example. > > Regards > > Antoine. >
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 22:02, Phillip Cloud a écrit : On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compute-IR. I think that leaves only flatbuffers as an option. If that's the case then agreed (well, you can always embed as a raw bytestring in other formats, but that wouldn't be pretty). I just wonder what the complexity of using Flatbuffers is for e.g. Python. IMO the complexity isn't high, but the generated code is definitely not idiomatic ( https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html) Wow. And you also have to integrate `flatc` in your build chain? IMHO that compares poorly to JSON or MsgPack, for example. Regards Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021 at 3:58 PM Antoine Pitrou wrote: > > Le 11/08/2021 à 21:56, Phillip Cloud a écrit : > > I can see how that might be a bit circular. Let me start from the > > perspective of requirements. We want to be able to reuse the arrow's > types > > and schema, without having to write additional code to move back and > forth > > between compute IR and not-compute-IR. I think that leaves only > flatbuffers > > as an option. > > If that's the case then agreed (well, you can always embed as a raw > bytestring in other formats, but that wouldn't be pretty). > > I just wonder what the complexity of using Flatbuffers is for e.g. Python. > IMO the complexity isn't high, but the generated code is definitely not idiomatic ( https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html) > > Regards > > Antoine. >
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 21:56, Phillip Cloud a écrit : I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compute-IR. I think that leaves only flatbuffers as an option. If that's the case then agreed (well, you can always embed as a raw bytestring in other formats, but that wouldn't be pretty). I just wonder what the complexity of using Flatbuffers is for e.g. Python. Regards Antoine.
Re: [DISCUSS] Splitting out the Arrow format directory
I can see how that might be a bit circular. Let me start from the perspective of requirements. We want to be able to reuse the arrow's types and schema, without having to write additional code to move back and forth between compute IR and not-compute-IR. I think that leaves only flatbuffers as an option. On Wed, Aug 11, 2021 at 3:52 PM Phillip Cloud wrote: > On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou wrote: > >> >> >> Le 11/08/2021 à 21:39, Phillip Cloud a écrit : >> > The benefit is that IR components don't interact much with >> `flatbuffers` or >> > `flatc` directly. >> > >> [...] >> > >> > One counter-proposal might be to just put the compute IR IDL in a >> separate >> > repo, >> > but that isn't tenable because the compute IR needs arrow's type >> information >> > contained in `Schema.fbs`. >> >> This argument seems predated on the hypothesis that the compute IR will >> use Flatbuffers. Is it set in stone? >> > > It's not set in stone, but so far it's the leading contender due to the > need to share elements of Schema.fbs. > > >> >> Regards >> >> Antoine. >> >
Re: [DISCUSS] Splitting out the Arrow format directory
On Wed, Aug 11, 2021 at 3:51 PM Antoine Pitrou wrote: > > > Le 11/08/2021 à 21:39, Phillip Cloud a écrit : > > The benefit is that IR components don't interact much with `flatbuffers` > or > > `flatc` directly. > > > [...] > > > > One counter-proposal might be to just put the compute IR IDL in a > separate > > repo, > > but that isn't tenable because the compute IR needs arrow's type > information > > contained in `Schema.fbs`. > > This argument seems predated on the hypothesis that the compute IR will > use Flatbuffers. Is it set in stone? > It's not set in stone, but so far it's the leading contender due to the need to share elements of Schema.fbs. > > Regards > > Antoine. >
Re: [DISCUSS] Splitting out the Arrow format directory
Le 11/08/2021 à 21:39, Phillip Cloud a écrit : The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. [...] One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's type information contained in `Schema.fbs`. This argument seems predated on the hypothesis that the compute IR will use Flatbuffers. Is it set in stone? Regards Antoine.
[DISCUSS] Splitting out the Arrow format directory
Hi all, I'd like to bring up an idea from a recent thread ([1]) about moving the `format/` directory out of the primary apache/arrow repository. I understand from that thread there are some concerns about using submodules, and I definitely sympathize with those concerns. In talking with David Li (disclaimer: we work together at Voltron Data), he has a great idea that I think makes everyone happy: an `apache/arrow-format` repository that is the official mirror for the flatbuffers IDL, that library authors should use as the source of truth. It doesn't require a submodule, yet it also allows external projects the ability to access the IDL without having to interact with the main arrow repository and is backwards compatible to boot. In this scenario, repositories that are currently copying in the flatbuffers IDL can migrate to this repository at their leisure. My motivation for this was around sharing data structures for the compute IR proposal ([2]). I can think of at least two ways for IR producers and consumers of all languages to share the flatbuffers IDL: 1. A set of bindings built in some language that other languages can integrate with, likely C++, that allows library users to build IR using an API. The primary downside to this is that we'd have to deal with building another library while working out any kinks in the IR design and I'd rather avoid that in the initial phases of this project. The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. 2. A single location where the format lives, that doesn't require depending on a large multi-language repository to access a handful of files. I think the downside to this is that there's a bit of additional infrastructure to automate copying in `arrow-format`. The benefit there is that producers and consumers can immediately start getting value from compute IR without having to wait for development of a new API. One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's type information contained in `Schema.fbs`. I was hoping to avoid conflating the discussion about bindings vs direct flatbuffer usage (at least initially just supporting one, I predict we'll need both ultimately) with the decision about whether to split out the format directory, but it's a good example of a choice for which splitting out the format directory would be well-served. I'll note that this doesn't block anything on the compute IR side, just wanted to surface this in a parallel thread and see what folks think. [1]: https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E [2]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l