Hi Paddy,

> What do you think about moving Arrow2 into the main Arrow repo where it
is only enabled via an "experimental" feature flag?

AFAIK this is already possible:
* add `arrow2 = { version = "0.2.0", optional = true }` to Cargo.toml
* add `#[cfg(feature = "arrow2")]\npub mod arrow2;\n` to lib.rs

We do this kind of thing to expose APIs from non-arrow crates such as parts
of the parquet-format-rs crate, and is generally the way to go when a crate
wants to expose a third-party API.

I would not recommend doing this, though: by exposing arrow2 from arrow, we
double the compilation time and binary size of all dependencies that
activate the flag. Furthermore, there are users of arrow2 that do not need
the arrow crate, which this model would not support.

AFAIK where development happens is unrelated to this aspect, Rust enables
this by design.

> but also this would be a clear signal that Arrow2 is <1.0.
> the experimental flag will be a clear signal to the existing Arrow
community that Arrow2 is the future but that it is <1.0

arrow2 is already <1.0 <https://crates.io/crates/arrow2>. My argument is
that the arrow/arrow-flight/parquet are not versioned according to the Rust
community standards: It is a de facto practice in Rust to delay major
releases until the API is stable. Tokio's blog post about their 1.0
<https://tokio.rs/blog/2020-12-tokio-1-0> (i.e. "[...] we commit to holding
back on a Tokio 2.0 release for at least 3 years."). 10 most downloaded
crates:

* https://crates.io/crates/rand (0.8.4)
* https://crates.io/crates/syn (1.0.74)
* https://crates.io/crates/libc (0.2.98)
* https://crates.io/crates/rand_core (0.6.3)
* quote (1.0.9)
* unicode-xid (0.2.2)
* proc-macro2 (1.0.28)
* cfg-if (1.0.0)
* https://crates.io/crates/serde (1.0.126)
* bitflags (1.2.1)

These are small crates with a small scope, but even larger projects share
the same pattern:

* crossbeam <https://crates.io/crates/crossbeam> (0.8.1)
* rocket <https://crates.io/crates/rocket> (0.5)
* polars <https://crates.io/crates/polars> (0.14.8)
* tower <https://crates.io/crates/tower> (0.4.8)
* Tokio <https://crates.io/crates/tokio> (1.9.0)
* hyper <https://crates.io/crates/hyper> (0.14.11)

Crates that arrow depends on
<https://github.com/apache/arrow-rs/blob/master/arrow/Cargo.toml>,
that DataFusion
depends on
<https://github.com/apache/arrow-datafusion/blob/master/datafusion/Cargo.toml>,
all share the same pattern of being either 0.X, 1.X when their API is
stable, and 2.X when they needed a large change in the API. This contrasts
with Apache Arrow's releases where we are now at 5.0 (and we have yet to
arrive at a safe design).

> existing users will be well supported in this transition

How so? imo people either PR to the arrow/arrow2 code base or they won't.
This is largely independent of where the development of either arrow2 or
arrow happens; people google the crate, click on the repository link and
file an issue or field a PR.

> In general, I think the longer that development proceeds in separate
repos the harder it will be to eventually merge the two in a way that
supports existing users.

How so? I may be mistaken, but API design is unrelated to on which repo the
development happens: it is primarily driven by who is designing it and from
where or who they are inspired by. Both arrow and parquet's crate design
are inspired by the C++ implementation and have gradually been migrated to
"idiomatic" Rust, as "idiomatic" is becoming more well defined in Rust.
Arrow2 is inspired by the current crate and the pains of using it in
DataFusion. Datafuse, a fork of datafusion, recently migrated to arrow2
<https://github.com/datafuselabs/datafuse/pull/1239>: +1,947 −3,484, which
shows that the crate is capturing important patterns from the arrow crate
and exposing ones that are useful / result in less code for the same or
higher performance.

On the opposite side, merging the development of crates under the same repo
leads to: more triagging of PRs; more work for releases and changelogging;
tagging based on crates; multiple READMEs in subpaths of the repo, curation
of the CI to accommodate this, a workspace with many crates each with its
own set of dependencies, increasing compilation and development; mixed
commit logs, difficulties in reverts and cherry-picks; more difficult to
find stuff in the repo. See e.g. how tokio-rs does it:
https://github.com/tokio-rs, even for small crates like bytes
<https://github.com/tokio-rs/bytes>.

Best,
Jorge

On Tue, Aug 3, 2021 at 3:13 PM paddy horan <paddyho...@hotmail.com> wrote:

> Hi Jorge,
>
> What do you think about moving Arrow2 into the main Arrow repo where it is
> only enabled via an "experimental" feature flag?  This would allow
> development of Arrow2 to proceed in the main repo but also this would be a
> clear signal that Arrow2 is <1.0.  When we feel ready (i.e. Arrow2 is 1.0)
> we can release it in the next main release with Arrow2 being the default
> and move the existing implementation behind a "legacy" feature flag.
>
> Here is why I think this might work well:
>  - People contributing to the Arrow project will naturally contribute to
> Arrow2.  At the moment, some people will still contribute to Arrow instead
> of Arrow2 just by virtue of it being the "official" implementation.
> However, if both are in one repo people will want to contribute to the
> "future", i.e. Arrow2.
>  - the experimental flag will be a clear signal to the existing Arrow
> community that Arrow2 is the future but that it is <1.0
>  - existing users will be well supported in this transition
>  - In general, I think the longer that development proceeds in separate
> repos the harder it will be to eventually merge the two in a way that
> supports existing users.
>
> Do you think would work?
>
> Paddy
>
> -----Original Message-----
> From: Jorge Cardoso Leitão <jorgecarlei...@gmail.com>
> Sent: Monday, August 2, 2021 1:59 PM
> To: dev@arrow.apache.org
> Subject: Re: [Discuss] [Rust] Arrow2/parquet2 going foward
>
> Hi,
>
> Sorry for the delay.
>
> If there is a path towards an official release under a <1.0.0 versioning
> schema aligned with the rest of the Rust ecosystem and in line with the
> stability of the API, then IMO we should move all development to within
> Apache experimental asap (I can handle this and the likely IP clearance
> round). If we require a release >=1.X.Y to it and/or a schedule, then I
> prefer to keep expectations aligned and postpone any movement.
>
> Under the move situation, I was thinking in something as follows:
>
> * gradually stop maintaining "arrow" in crates, offering a maintenance
> window over which we release patches (*)
> * work towards achieving feature parity on arrow2/parquet2 on the
> experimental repos.
> * keep releasing arrow2/parquet2 under a 0.X model during the step above
> (**)
> * migrate to arrow-rs and archive experimentals (***)
> * break arrow2 in smaller crates so that we can version the APIs at a
> different cadence
> * once a crate reaches some stability (this is always opinionated, but it
> is fine), we bump it to 1.0 and announce a maintenance plan ala tokio <
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftokio.rs%2Fblog%2F2020-12-tokio-1-0&amp;data=04%7C01%7C%7C1b3176da8b6b45407c4208d955df3394%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637635239391364824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lpj8KTpf3c3t0zxo28dSqtuJ82xfMtPssmxzNkrj%2BBQ%3D&amp;reserved=0
> >.
>
> (*) e.g. "we will continue to patch the arrow crate up to at least 6
> months starting after the first release of arrow2 that supports
> a) nested parquet read and write
> b) union array (including IPC integration tests)
> c) map array (including IPC integration tests)"
>
> (**) officially or un-officially (I would suggest officially so that we
> can acknowledge everyone's work on it, but no strong feelings)
>
> (***) something like:
> 1. place arrow2 on top of a clear arrow repo so that the full contribution
> history up to that point preserved 2. make arrow-rs the home of arrow2
> (i.e. we start releasing arrow2 from
> arrow-rs) and archive the experimental repos; create arrow-rs-parquet or
> something for parquet2.
>
> In summary, the core pain point for me is the current versioning of arrow,
> which I feel is incompatible with my goals for arrow2 and the ecosystem I
> envision it supporting :)
>
> Best,
> Jorge
>
> On Fri, Jul 30, 2021 at 8:44 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > I think it would also be fine to push "beta" arrow2 crates out of a
> > repo under apache/ so long as they are not marked on crates.io as
> > being Apache-official releases. There's a possible slippery slope
> > there, but as long as we are on a path to formalizing the releases I
> think it is okay.
> >
> > On Fri, Jul 30, 2021 at 1:07 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > Jorge -- do you feel like we have a resolution on what to do with
> > > arrow2
> > in
> > > the near term?
> > >
> > > The current state of affairs seems to me that arrow2 is released
> > > from
> > >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorgecarleitao%2Farrow2&amp;data=04%7C01%7C%7C1b3176da8b6b45407c4208d955df3394%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637635239391364824%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=W1TaT%2BFVGrGL1Oay9QclLozhkfNS78jPdrkZFIFRtjA%3D&amp;reserved=0
> to crates.io (which is fine).
> > > Are
> > > you happy with keeping development in the jorgecarleitao repo where
> > > you will retain maximal control and flexibility until it is ready to
> > > start integrating?
> > >
> > > Or would you prefer to put it into one of the apache repos and
> > > subject
> > its
> > > development and release to the normal Arrow governance model
> > > (tarball, vote, etc)?
> > >
> > > Since you are the primary author/architect I think you should have a
> > > substantial say at this stage.
> > >
> > > Andrew
> > >
> > >
> > > On Tue, Jul 27, 2021 at 7:16 PM Andrew Lamb <al...@influxdata.com>
> > wrote:
> > >
> > > > I would be happy with this approach. Thank you for the suggestion
> > > >
> > > > This hybrid approach of both arrow and arrow2 in the same repo
> > > > seems better to me than separate repos.
> > > >
> > > > What I really care about is ensuring we don't have two crates/APIs
> > > > indefinitely -- as long as we are continually making progress
> > > > towards unification that is what is important to me.
> > > >
> > > > Andrew
> > > >
> > > > On Tue, Jul 27, 2021 at 1:40 PM Andy Grove <andygrov...@gmail.com>
> > > wrote:
> > > >
> > > >> Apologies for being late to this discussion.
> > > >>
> > > >> There is a hybrid option to consider here where we add the arrow2
> > > >> code into the arrow crate as a separate module, so we release one
> > > >> crate
> > containing
> > > >> the "old" API (which we can mark as deprecated) as well as the
> > > >> new
> > API.
> > > >> Java did a similar thing a long time ago with "java.io" versus
> > > "java.nio"
> > > >> (new IO).
> > > >>
> > > >> I agree that the versioning wouldn't be ideal, but this seems
> > > >> like it might be a pragmatic compromise?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Andy.
> > > >>
> > > >>
> > > >> On Tue, Jul 20, 2021 at 5:41 AM Andrew Lamb
> > > >> <al...@influxdata.com>
> > > wrote:
> > > >>
> > > >> > What I meant is that when you decide arrow2 is suitable for
> > > >> > release
> > to
> > > >> > existing arrow users, I stand ready to help you incorporate it
> > > >> > into
> > > >> arrow.
> > > >> >
> > > >> > All the feedback I have heard so far from the rest of the
> > > >> > community
> > is
> > > >> that
> > > >> > we are ready. One might even say we are anxious to do so :)
> > > >> >
> > > >> > Andrew
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to