Hi, The main benefit I see for a standard for queries would not be on a serialization format, but on its semantics.
IMO one of the main reasons for a lack of a standard of queries at the protobuf level is that human-readability vastly outweighs serialization - queries are at very most a megabyte in size, and SQL (or json or toml) is often enough, and much easier to read, share, learn, etc. With that said, imo a core challenge of SQL today is that every engine implements small semantic variations, which often derive from a myriad of tradeoffs that the different engines have to make, often consequent of how each of them represents data in-memory. Because arrow imposes a fixed in-memory format, those tradeoffs are more likely to be similar across implementations, and thus semantic parity is easier to spec. (The counter argument is that different engines have different goals / use-cases, and these use-cases induce different querying patterns and thus different tradeoffs) The use-case I see here would be to have an alignment on a SQL dialect and require implementations to return the same *result* over a given IPC file; I.e. SQL statement + IPC files = IPC file would hold true for every implementation. Best, Jorge On Thu, Mar 18, 2021 at 5:22 PM Andrew Lamb <al...@influxdata.com> wrote: > Any higher level physical execution plan most likely needs a way to > represent expressions. Thus focusing initially on a standard for > expressions might be a good way to add value but keep the scope of the > effort reasonable > > On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I think there might be discussion on two levels of computation, physical > > query execution plans, and potentially something "lower level"? When > this > > has come up in the past, I was a little skeptical of constraining every > SDK > > to use the same description, so I agree with Wes's point about keeping > any > > spec open in the short term. Ballista as an opt-in model, does sound > like > > possibly the right approach. > > > > I might be misunderstanding, but I think Weld [1] is another project > > targeting the lower level components? > > > > Also, I think there was a little bit of effort to come up with a common > > expression representation within C++, but got stalled on whether to use > the > > Gandiva expression descriptions or Flatbuffers, I can't seem to find the > > thread/JIRA/discussion on this. I'll try to look some more this evening. > > > > [1] https://github.com/weld-project/weld > > > > On Thu, Mar 18, 2021 at 7:53 AM Jed Brown <j...@jedbrown.org> wrote: > > > > > I'm interested in providing some path to make this extensible. To pick > an > > > example, suppose the user wants to compute the first k principle > > > components. We've talked [1] about the possibility of incorporating > > richer > > > communication semantics in Ballista (a la MPI sub-communicators) and > > > numerical algorithms such as PCA would benefit. Those specific > algorithms > > > wouldn't belong in Arrow or Ballista core, but I think there's an > > > opportunity for plugins to offer this sort of capability and it would > be > > > lovely if the language-independent protocol could call them. Do you > see a > > > good way to do this via ballista.proto? > > > > > > [1] https://github.com/ballista-compute/ballista/issues/303 > > > > > > Andy Grove <andygrov...@gmail.com> writes: > > > > > > > Hi Paddy, > > > > > > > > Thanks for raising this. > > > > > > > > Ballista defines computations using protobuf [1] to describe logical > > and > > > > physical query plans, which consist of operators and expressions. It > is > > > > actually based on the Gandiva protobuf [2] for describing > expressions. > > > > > > > > I see a lot of value in standardizing some of this across > > > implementations. > > > > Ballista is essentially becoming a distributed scheduler for Arrow > and > > > can > > > > work with any implementation that supports this protobuf definition > of > > > > query plans. > > > > > > > > It would also make it easier to embed C++ in Rust, or Rust in C++, > > having > > > > this common IR, so I would be all for having something like this as > an > > > > Arrow specification. > > > > > > > > Thanks, > > > > > > > > Andy. > > > > > > > > [1] > > > > > > > > > > https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto > > > > [2] > > > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto > > > > > > > > > > > > On Thu, Mar 18, 2021 at 7:40 AM paddy horan <paddyho...@hotmail.com> > > > wrote: > > > > > > > >> Hi All, > > > >> > > > >> I do not have a computer science background so I may not be asking > > this > > > in > > > >> the correct way or using the correct terminology but I wonder if we > > can > > > >> achieve some level of standardization when describing computation > over > > > >> Arrow data. > > > >> > > > >> At the moment on the Rust side DataFusion clearly has a way to > > describe > > > >> computation, I believe that Ballista adds the ability to serialize > > this > > > to > > > >> allow distributed computation. On the C++ side work is starting on > a > > > >> similar query engine and we already have Gandiva. Is there an > > > opportunity > > > >> to define a kind of IR for computation over Arrow data that could be > > > >> adopted across implementations? > > > >> > > > >> In this case DataFusion could easily incorporate Gandiva to generate > > > >> optimized compute kernels if they were using the same IR to describe > > > >> computation. Applications built on Arrow could "describe" > computation > > > in > > > >> any language and take advantage or innovations across the community, > > > adding > > > >> this to Arrow's zero copy data sharing could be a game changer in my > > > mind. > > > >> I'm not someone who knows enough to drive this forward but I > obviously > > > >> would like to get involved. For some time I was playing around with > > > using > > > >> TVM's relay IR [1] and applying it to Arrow data. > > > >> > > > >> As the Arrow memory format has now matured I fell like this could be > > the > > > >> next step. Is there any plan for this kind of work or are we going > to > > > >> allow sub-projects to "go their own way"? > > > >> > > > >> Thanks, > > > >> Paddy > > > >> > > > >> [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation ( > > apache.org > > > )< > > > >> https://tvm.apache.org/docs/dev/relay_intro.html> > > > >> > > > >> > > > > > >