Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Micah Kornfield Thu, 19 Sep 2019 00:01:48 -0700

Wes,
Let me see if I understand, I think there are two issues:
1.  Ensuring conformance of interoperability and actually having people
understand what Arrow actually is and what it is not.
2.  Having users adopt reference implementations and surrounding libraries.

For 1, I agree we should have a way of measuring things here.  I think
being able to document the requirements of our test-suite and have it
generate a report on features supported would go a long way to letting
users understand the quality of both internal/external implementations.  It
seems like there is still a lot of misunderstanding of what Arrow is and
how it relates to other technologies.  An example of this is a recent Julia
thread [1], which seems to have both some misinformed commentary and
potentially some points that we could improve upon as a community.
Hopefully, some of this will be helped by separately versioning the
specification and the libraries post 1.0.0.

For 2, I agree that having people adopt our code (and hopefully contribute
back) is the ideal situation.  I think there are likely a few challenges
here:
*  How amenable are existing libraries to embedding. Wes, you've started
other threads on how to make this adoption easier on the C++ side.
* How much of a value proposition there is in the reference libraries.
Arrow has seen good adoption in Python due to its support for Parquet and
Feather.  I assume as the dataset and other projects get flushed out this
will lead to further adoption.  Conformance to the specification is a
feature as well, but I would guess its less important to many of the end
users of pyarrow who see it as a way of integrating with other non-arrow
technologies.
* Technical limitation of the specification (for example some processing
engines do need alternative encodings like RLE).

Am I understanding your points?

Thanks,
Micah

[1] https://discourse.julialang.org/t/arrow-feather-and-parquet/28739

On Tue, Sep 17, 2019 at 6:00 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> > >
> > > Let's take an example:
> > >
> > > * Dremio can execute SQL and uses Arrow as its native runtime format
> > > * Apache Spark can execute SQL and offers UDF support with Arrow
> > > format, i.e. so using Arrow for IO
> > >
> > > Both of these projects can say that they "use Apache Arrow", but the
> > > extent to which Arrow is a key ingredient may not be obvious to the
> > > average onlooker. To have more "Arrow-native" systems seems like one
> > > of the missions of the project.
> > >
> >
> > I'm not following you here. Are you suggesting that these systems are
> > Arrow-native or not Arrow-native? Or that one is and the other is not?
> What
> > does Arrow-native mean to you?
> >
> > Do you think there is enough problems around this right now that we need
> to
> > do something? It seems like you're concerned about people claiming they
> are
> > using Arrow when they aren't quite. Right now, it seems like the
> community
> > mostly benefits from people saying they are using Arrow. Have you seen
> > situations where users/consumers were frustrated because something was
> > Arrow but not really Arrow?
>
> I think it's good that using Arrow in some way has become a mark of
> quality for systems.
>
> My argument is mostly about brand quality control. Early on in Apache
> Arrow, some people who learned about the project asked me, essentially
> "what's the point of developing reference implementations if everyone
> 'just follows the specification'?". Even now people have said similar
> to me in the context of our occasional difficulties scaling our build
> and packaging, i.e. "why are you making your life so difficult
> building all this systems software, if the specification is all you
> really need to use Arrow?"
>
> In an extreme case, Apache Arrow could be a single Markdown document
> in a git repository describing the Arrow protocol and that's it.
>
> As a project insider who's been overseeing the development of the
> reference implementations, the prospect of a proliferation of
> implementations lacking in integration tests with each other terrifies
> me. This has already happened with the Parquet format in some ways.
>
> One of the raison d'etres of the project is interoperability. I would
> like for people to see "Arrow" and understand what they're getting, or
> at least be advised about where a project falls short of
> interoperability.
>
> - Wes
>

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Reply via email to