Wes, Let me see if I understand, I think there are two issues: 1. Ensuring conformance of interoperability and actually having people understand what Arrow actually is and what it is not. 2. Having users adopt reference implementations and surrounding libraries.
For 1, I agree we should have a way of measuring things here. I think being able to document the requirements of our test-suite and have it generate a report on features supported would go a long way to letting users understand the quality of both internal/external implementations. It seems like there is still a lot of misunderstanding of what Arrow is and how it relates to other technologies. An example of this is a recent Julia thread [1], which seems to have both some misinformed commentary and potentially some points that we could improve upon as a community. Hopefully, some of this will be helped by separately versioning the specification and the libraries post 1.0.0. For 2, I agree that having people adopt our code (and hopefully contribute back) is the ideal situation. I think there are likely a few challenges here: * How amenable are existing libraries to embedding. Wes, you've started other threads on how to make this adoption easier on the C++ side. * How much of a value proposition there is in the reference libraries. Arrow has seen good adoption in Python due to its support for Parquet and Feather. I assume as the dataset and other projects get flushed out this will lead to further adoption. Conformance to the specification is a feature as well, but I would guess its less important to many of the end users of pyarrow who see it as a way of integrating with other non-arrow technologies. * Technical limitation of the specification (for example some processing engines do need alternative encodings like RLE). Am I understanding your points? Thanks, Micah [1] https://discourse.julialang.org/t/arrow-feather-and-parquet/28739 On Tue, Sep 17, 2019 at 6:00 PM Wes McKinney <wesmck...@gmail.com> wrote: > On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > > > > > Let's take an example: > > > > > > * Dremio can execute SQL and uses Arrow as its native runtime format > > > * Apache Spark can execute SQL and offers UDF support with Arrow > > > format, i.e. so using Arrow for IO > > > > > > Both of these projects can say that they "use Apache Arrow", but the > > > extent to which Arrow is a key ingredient may not be obvious to the > > > average onlooker. To have more "Arrow-native" systems seems like one > > > of the missions of the project. > > > > > > > I'm not following you here. Are you suggesting that these systems are > > Arrow-native or not Arrow-native? Or that one is and the other is not? > What > > does Arrow-native mean to you? > > > > Do you think there is enough problems around this right now that we need > to > > do something? It seems like you're concerned about people claiming they > are > > using Arrow when they aren't quite. Right now, it seems like the > community > > mostly benefits from people saying they are using Arrow. Have you seen > > situations where users/consumers were frustrated because something was > > Arrow but not really Arrow? > > I think it's good that using Arrow in some way has become a mark of > quality for systems. > > My argument is mostly about brand quality control. Early on in Apache > Arrow, some people who learned about the project asked me, essentially > "what's the point of developing reference implementations if everyone > 'just follows the specification'?". Even now people have said similar > to me in the context of our occasional difficulties scaling our build > and packaging, i.e. "why are you making your life so difficult > building all this systems software, if the specification is all you > really need to use Arrow?" > > In an extreme case, Apache Arrow could be a single Markdown document > in a git repository describing the Arrow protocol and that's it. > > As a project insider who's been overseeing the development of the > reference implementations, the prospect of a proliferation of > implementations lacking in integration tests with each other terrifies > me. This has already happened with the Parquet format in some ways. > > One of the raison d'etres of the project is interoperability. I would > like for people to see "Arrow" and understand what they're getting, or > at least be advised about where a project falls short of > interoperability. > > - Wes >