Hi,

Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow.
>

1.X versioning in Apache Arrow was never meant to represent stability of
their individual libraries, but only the stability of the C++/Python and
the spec. It is a misconception that Rust implementation is stable and/or
ready for production; its version is aligned with Apache Arrow general
versioning simply for historical reasons. Requiring arrow2 to also be
marked as stable is imo just dragging this onwards.

As primary developer of arrow2 and a contributor of some of the major
pieces of arrow-rs, I am saying that:

* arrow-rs does not have a stable API: it requires large large incompatible
changes to even make it *safe*
* arrow2 does not have a stable API: it requires incompatible changes to
improve UX, performance, and functionality
* using arrow2 core API results in faster, safer, and less error-prone code

The main difference is that arrow-rs requires API changes to its core
modules (buffer, bytes, etc), while arrow2 requires changes to its
peripheral modules (compute and IO). This is why imo we can make arrow2
available: expected changes will only break a small surface of the public
API which, while incompatible, are easy to address.

Which is the gist of my proposal:

   - Arrow2 starts its release in cargo.io as 0.1
   - A major release (e.g. 0.16.2 -> 1.0.0):
      - must be voted
      - may be backward incompatible
   - Minor releases (e.g. 0.16.1 -> 0.17.0):
      - must be voted
      - may be backward incompatible
      - may have new features
   - Patch releases (e.g. 0.16.1 -> 0.16.2):
      - may be voted
      - must not be backward compatible
      - may have new features
   - Minor releases may have a maintenance period (e.g. 3+ months) over
   which we guarantee security patches and feature backports.
   - Major releases have a maintenance period over which we guarantee
   security patches and feature backports according to semver 2.0.

So that:

   - It aligns expectations wrt to the current state of Rust's
   implementation
   - it offers support to downstream dependencies that require longer-term
   stability
   - it offers room for developers to improve its API, scrutinize security,
   etc.

If we do indeed have an expectation of stability over its whole public
surface, then I suggest that we keep arrow2 in the experimental repo as it
is today.

Btw, this is why some in the Rust community recommend using smaller crates:
so that versioning is not bound to a large public API surface and can thus
more easily be applied to smaller surfaces. There is of course a tradeoff
with maintenance of CI and releases.

Best,
Jorge

On Sat, Jul 17, 2021 at 1:59 PM Andrew Lamb <al...@influxdata.com> wrote:

> What if we released "beta" [1] versions of arrow on cargo at whatever pace
> was necessary? That way dependent crates could opt in to bleeding edge
> functionality / APIs.
>
> There is tension between full technical freedom to change APIs and the
> needs of downstream projects for a more stable API.
>
> Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow. I don't think we can abandon maintenance
> on the existing codebase until we have a successor ready.
>
> Andrew
>
> p.s. I personally very much like Adam's suggestion for "Arrow 6.0 in Oct
> 2021 be based on arrow2" but that is predicated on wanting to have arrow2
> widely used by downstreams at that point.
>
> [1]
>
> https://stackoverflow.com/questions/46373028/how-to-release-a-beta-version-of-a-crate-for-limited-public-testing
>
>
> On Sat, Jul 17, 2021 at 5:56 AM Adam Lippai <a...@rigo.sk> wrote:
>
> > 5.0 is being released right now, which means from timing perspective this
> > is the worst moment for arrow2, indeed. You'd need to wait the full 3
> > months. On the other hand does releasing a 6.0 beta based on arrow2 on
> Aug
> > 1st, rc on Sept 1st and releasing the stable on Oct 1st sound like a bad
> > plan?
> >
> > I don't think a 6.0-beta release would be confusing and dedicating most
> of
> > the 5.0->6.0 cycle to this change doesn't sound excessive.
> >
> > I think this approach wouldn't result in extra work (backporting the
> > important changes to 5.1,5.2 release). It only shows the magnitude of
> this
> > change, the work would be done by you anyways, this would just make it
> > clear this is a huge effort.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sat, Jul 17, 2021, 11:31 Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Arrow2 and parquet2 have passed the IP clearance vote and are ready to
> be
> > > merged to apache/* repos.
> > >
> > > My plan is to merge them and PR to both of them to the latest updates
> on
> > my
> > > own repo, so that I can temporarily (and hopefully permanently) archive
> > the
> > > versions of my account and move development to apache/*.
> > >
> > > Most of the work happening in arrow-rs is backward compatible or simple
> > to
> > > deprecate. However, this situation is different in arrow2 and
> parquet2. A
> > > release cadence of a major every 3 months is prohibitive at the pace
> > that I
> > > am plowing through.
> > >
> > > The core API (types, alloc, buffer, bitmap, array, mutable array) is
> imo
> > > stable and not prone to change much, but the non-core API (namely IO
> and
> > > compute) is prone to change. Examples:
> > >
> > > * Add Scalar API to allow dynamic casting over the aggregate kernels
> and
> > > parquet statistics
> > > * move compute/ from the arrow crate into a separate crate
> > > * move io/ from the arrow crate into a separate crate
> > > * add option to select encoding based on DataType and field name when
> > > writing to parquet
> > >
> > > (I will create issues for them in the experimental repos for proper
> > > visibility and discussion).
> > >
> > > This situation is usually addressed via the 0.X model in semver 2 (in
> > > Python fastAPI <https://fastapi.tiangolo.com/> is a predominant
> example
> > > that uses it, and almost all in Rust also uses it). However, there are
> a
> > > couple of blockers in this context:
> > >
> > > 1. We do not allow releases of experimental repos to avoid confusion
> over
> > > which is *the* official package.
> > > 2. arrow-rs is at version 5, and some dependencies like IOx/Influx seem
> > to
> > > prefer a slower release cadence of breaking changes.
> > >
> > > On the other hand, other parts of the community do not care about this
> > > aspect. Polars for example, the fastest DataFrame in H2O benchmarks,
> > > currently maintains an arrow2 branch that is faster and safer than
> master
> > > [1], and will be releasing the Python binaries from the arrow2 branch.
> We
> > > would like to release the Rust API also based on arrow2, which requires
> > it
> > > to be in Cargo.
> > >
> > > The best “hack” that I can come up with given the constraints above is
> to
> > > release arrow2 and parquet2 in cargo.io from my personal account so
> that
> > > dependents can release to cargo while still making it obvious that they
> > are
> > > not the official release. However, this is obviously not ideal.
> > >
> > > Any suggestions?
> > >
> > > [1] https://github.com/pola-rs/polars/pull/922
> > >
> > > Best,
> > > Jorge
> > >
> >
>

Reply via email to