Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Wes McKinney Thu, 19 Sep 2019 08:40:40 -0700

On Thu, Sep 19, 2019 at 2:01 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Wes,
> Let me see if I understand, I think there are two issues:
> 1.  Ensuring conformance of interoperability and actually having people
> understand what Arrow actually is and what it is not.
> 2.  Having users adopt reference implementations and surrounding libraries.
>
> For 1, I agree we should have a way of measuring things here.  I think
> being able to document the requirements of our test-suite and have it
> generate a report on features supported would go a long way to letting
> users understand the quality of both internal/external implementations.  It
> seems like there is still a lot of misunderstanding of what Arrow is and
> how it relates to other technologies.  An example of this is a recent Julia
> thread [1], which seems to have both some misinformed commentary and
> potentially some points that we could improve upon as a community.
> Hopefully, some of this will be helped by separately versioning the
> specification and the libraries post 1.0.0.

Thanks for the pointer to the thread. I've been trying for a couple of
years to engage with the Julia community.

The bottom line is that I think it's important to highlight that
compatibility or interoperability will not be achieved by hand-waving.
There's a couple of things we can do

* In our "implementation guidelines", indicate the procedure for third
party implementations to validate themselves against the reference
implementations
* Recommend that third party implementations advertise their feature
coverage and degree of compatibility/integration testing

> For 2, I agree that having people adopt our code (and hopefully contribute
> back) is the ideal situation.  I think there are likely a few challenges
> here:
> *  How amenable are existing libraries to embedding. Wes, you've started
> other threads on how to make this adoption easier on the C++ side.

Yes, I think unfortunately some people have gotten the mistaken
impression that the complexity involved with creating and deploying a
comprehensive build of everything we have in the project is being
foisted on to each developer or user of the project, no matter how
limited their use of the Arrow format. If that becomes the rationale
for creating third party implementations that is very sad indeed.

This can be corrected by better developer documentation to clarify the
different "routes"

* Minimal builds, for people who just want to use the columnar format
and protocol. I think having a "zero dependency" out of the box C++
build would help address this (we can discuss this more in the
separate thread). The current out of the box experience may be a bit
off-putting to some users because a number of optional components are
being built by default, see https://github.com/apache/arrow/pull/5431
* Comprehensive builds, for people who are contributing to the Apache
project and need to be able to build everything

I think we have invested our time in documenting the latter at the
expense of the former.

> * How much of a value proposition there is in the reference libraries.
> Arrow has seen good adoption in Python due to its support for Parquet and
> Feather.  I assume as the dataset and other projects get flushed out this
> will lead to further adoption.  Conformance to the specification is a
> feature as well, but I would guess its less important to many of the end
> users of pyarrow who see it as a way of integrating with other non-arrow
> technologies.

We're at a significant documentation and communication deficit
relative to the development we've completed in the project. As much as
I'd like to push personally on new features, I'm going to make time to
write more documentation and blog posts to help communicate the value
of the work we've done.

> * Technical limitation of the specification (for example some processing
> engines do need alternative encodings like RLE).

I agree, and so I think it's important that we pursue the "encoded
record batch" proposal and get something codified in the near future.
Once the 0.15.0 release is behind us I hope to take a closer look and
help drive that forward.

> Am I understanding your points?
>
> Thanks,
> Micah
>
> [1] https://discourse.julialang.org/t/arrow-feather-and-parquet/28739
>
> On Tue, Sep 17, 2019 at 6:00 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau <jacq...@apache.org> wrote:
> > >
> > > >
> > > > Let's take an example:
> > > >
> > > > * Dremio can execute SQL and uses Arrow as its native runtime format
> > > > * Apache Spark can execute SQL and offers UDF support with Arrow
> > > > format, i.e. so using Arrow for IO
> > > >
> > > > Both of these projects can say that they "use Apache Arrow", but the
> > > > extent to which Arrow is a key ingredient may not be obvious to the
> > > > average onlooker. To have more "Arrow-native" systems seems like one
> > > > of the missions of the project.
> > > >
> > >
> > > I'm not following you here. Are you suggesting that these systems are
> > > Arrow-native or not Arrow-native? Or that one is and the other is not?
> > What
> > > does Arrow-native mean to you?
> > >
> > > Do you think there is enough problems around this right now that we need
> > to
> > > do something? It seems like you're concerned about people claiming they
> > are
> > > using Arrow when they aren't quite. Right now, it seems like the
> > community
> > > mostly benefits from people saying they are using Arrow. Have you seen
> > > situations where users/consumers were frustrated because something was
> > > Arrow but not really Arrow?
> >
> > I think it's good that using Arrow in some way has become a mark of
> > quality for systems.
> >
> > My argument is mostly about brand quality control. Early on in Apache
> > Arrow, some people who learned about the project asked me, essentially
> > "what's the point of developing reference implementations if everyone
> > 'just follows the specification'?". Even now people have said similar
> > to me in the context of our occasional difficulties scaling our build
> > and packaging, i.e. "why are you making your life so difficult
> > building all this systems software, if the specification is all you
> > really need to use Arrow?"
> >
> > In an extreme case, Apache Arrow could be a single Markdown document
> > in a git repository describing the Arrow protocol and that's it.
> >
> > As a project insider who's been overseeing the development of the
> > reference implementations, the prospect of a proliferation of
> > implementations lacking in integration tests with each other terrifies
> > me. This has already happened with the Parquet format in some ways.
> >
> > One of the raison d'etres of the project is interoperability. I would
> > like for people to see "Arrow" and understand what they're getting, or
> > at least be advised about where a project falls short of
> > interoperability.
> >
> > - Wes
> >

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Reply via email to