Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Andrew Lamb
I  agree with you both.

Users would love to have a project with multi-year maintenance with a
completely stable backwards compatible API (aka what tokio has promised)
that does everything they need.

However, building such software is (very) costly both initially and then
much more so for the ongoing maintenance; Until there is a need
(demonstrated by the willingness to pay the cost) from users of Rust/Arrow
for such maintenance I don't see how to make it happen.

Evidence of the lack of demand for longer 'supported' releases in my mind:
No one I know of has asked for, let alone volunteered to help create an
arrow-rs maintenance release (e.g.  4.4.1)  with just bug fixes. We have
all the process setup to make it happen, but no one cares yet.

I agree with Adam that there is middle ground here and I don't see any
insurmountable incompatibilities in release versions or processes.

Andrew

On Fri, Aug 6, 2021 at 5:31 AM Adam Lippai  wrote:

> Hi,
>
> Thanks for the detailed answer.
>
> In contrast to my previous email, my opinionated part:
>
> Generally I like the idea of smaller crates, it helps with a lot of stuff
> (different targets, build time), but those benefits can be achieved by
> feature gates too.
> The upside would be out-of-sync crate releases.
>
> Maintenance is important, historically speaking I've seen it solved for
> open source by private companies offering it as a paid service.
> You are right that currently only 3 months of support is provided for free,
> but personally I don't see that as an issue.
> There are professional libraries and software with close to 100% market
> share in their field which support the last or last two versions only
> (Chrome, OS-es, compilers).
> I find it hard to imagine we'd want to do it *better*, that sounds to be an
> illusion, but I'd like to be wrong on this one :)
> Professionally speaking, when picking projects, having Apache (or other)
> governance and community is more important for the businesses I worked
> with, than the release schedule or API stability / versioning.
>
>
> Based on the above and that there are about a dozen active Rust arrow
> contributors, any promise for reliable maintenance over years would be a
> lie in my eyes.
> DataFusion, Polars, odbc2parquet and others had issues with the changes
> being too slow, not too fast.
>
> I'm a big advocate of middle grounds and I still believe that your efforts
> and ideal setup is compatible with arrow-rs, nobody would stop you creating
> a 5.23.0 release next to the 6.1.0 if you'd want to backport anything and
> nobody would stop you cutting an out-of-schedule 6.2 or even 7.0 release if
> it's to ensure security. The frequent Apache release process - which we
> were afraid of - was smooth so far, with surprisingly nice support from
> members of different languages / implementations.
>
> Also I believe that any plan you'd have turning arrow2 into arrow-rs 6.0
> would be more than welcome on a public vote, along with the technical
> chances you propose (eg. cutting a separate arrow-io crate).
>
>
> At least 6 key members showed their excitement for your changes in this
> thread and even more on Slack/GitHub ;)
>
> Best regards,
> Adam Lippai
>
> On Fri, Aug 6, 2021 at 10:07 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks for your input.
> >
> > Every time there is a new major release, all new development shifts
> towards
> > that new API and users of previous APIs are left behind. It is not just a
> > matter of SemVer and size of version numbers, there is a whole
> development
> > shift to be on top of the new API.
> >
> > I disagree that a software that has a major release every 3 months and no
> > maintenance window over previous versions is stable. I alluded to the
> Tokio
> > example because Tokio 1.0 recently became the runtime of rust-based AWS
> > lambda functions [1]; this commitment is only possible by enforcing API
> > stability and maintenance beyond a 3 month period (at least 3 years in
> > their case).
> >
> > Also, imo the current major version number is not meaningless: divided by
> > the software age, it constitutes the historical release pattern and is
> > usually a good predictor of the pattern used in future releases.
> >
> > The evidence is that we haven't been able to support any version for any
> > period of time; recently, Andrew has been doing amazing work at
> supporting
> > the latest version for a period of 3 months. I.e. an application that
> > depends on `arrow = ^5.0` has a support window of 3 months. Given that we
> > have not backported any security fixes to previous versions, it is
> > reasonable to assume that security patches are also applied within a 3
> > month period only.
> >
> > As contributor of arrow2, I would rather not have arrow2 under Apache
> Arrow
> > than having to release it under its current versioning and scheduling
> (this
> > is similar to some of Julia's concerns). As a contributor to the Apache
> > Arr

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Adam Lippai
Hi,

Thanks for the detailed answer.

In contrast to my previous email, my opinionated part:

Generally I like the idea of smaller crates, it helps with a lot of stuff
(different targets, build time), but those benefits can be achieved by
feature gates too.
The upside would be out-of-sync crate releases.

Maintenance is important, historically speaking I've seen it solved for
open source by private companies offering it as a paid service.
You are right that currently only 3 months of support is provided for free,
but personally I don't see that as an issue.
There are professional libraries and software with close to 100% market
share in their field which support the last or last two versions only
(Chrome, OS-es, compilers).
I find it hard to imagine we'd want to do it *better*, that sounds to be an
illusion, but I'd like to be wrong on this one :)
Professionally speaking, when picking projects, having Apache (or other)
governance and community is more important for the businesses I worked
with, than the release schedule or API stability / versioning.


Based on the above and that there are about a dozen active Rust arrow
contributors, any promise for reliable maintenance over years would be a
lie in my eyes.
DataFusion, Polars, odbc2parquet and others had issues with the changes
being too slow, not too fast.

I'm a big advocate of middle grounds and I still believe that your efforts
and ideal setup is compatible with arrow-rs, nobody would stop you creating
a 5.23.0 release next to the 6.1.0 if you'd want to backport anything and
nobody would stop you cutting an out-of-schedule 6.2 or even 7.0 release if
it's to ensure security. The frequent Apache release process - which we
were afraid of - was smooth so far, with surprisingly nice support from
members of different languages / implementations.

Also I believe that any plan you'd have turning arrow2 into arrow-rs 6.0
would be more than welcome on a public vote, along with the technical
chances you propose (eg. cutting a separate arrow-io crate).


At least 6 key members showed their excitement for your changes in this
thread and even more on Slack/GitHub ;)

Best regards,
Adam Lippai

On Fri, Aug 6, 2021 at 10:07 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Thanks for your input.
>
> Every time there is a new major release, all new development shifts towards
> that new API and users of previous APIs are left behind. It is not just a
> matter of SemVer and size of version numbers, there is a whole development
> shift to be on top of the new API.
>
> I disagree that a software that has a major release every 3 months and no
> maintenance window over previous versions is stable. I alluded to the Tokio
> example because Tokio 1.0 recently became the runtime of rust-based AWS
> lambda functions [1]; this commitment is only possible by enforcing API
> stability and maintenance beyond a 3 month period (at least 3 years in
> their case).
>
> Also, imo the current major version number is not meaningless: divided by
> the software age, it constitutes the historical release pattern and is
> usually a good predictor of the pattern used in future releases.
>
> The evidence is that we haven't been able to support any version for any
> period of time; recently, Andrew has been doing amazing work at supporting
> the latest version for a period of 3 months. I.e. an application that
> depends on `arrow = ^5.0` has a support window of 3 months. Given that we
> have not backported any security fixes to previous versions, it is
> reasonable to assume that security patches are also applied within a 3
> month period only.
>
> As contributor of arrow2, I would rather not have arrow2 under Apache Arrow
> than having to release it under its current versioning and scheduling (this
> is similar to some of Julia's concerns). As a contributor to the Apache
> Arrow, I currently cannot guarantee a maintenance window over arrow-rs for
> any period of time because it is unsafe by design and I do not have the
> motivation to fix it. As both, I am confident that the core arrow2 will
> soon reach a point where we can live with and develop on top of it for at
> least a year. This is not true to the whole API surface, though: there are
> APIs that we will need to change more often until stability can be
> promised.
>
> So, I am requesting that we tie the discussion of arrow2 to how it will be
> released.
>
> Could a middle ground be somewhere along the lines of splitting the crate
> in smaller crates that are versioned independently. I.e. continue to
> release `arrow` under the same versioning and cadence, and create 3 new
> crates, arrow-core, arrow-compute, and arrow-io (see also [2]) that would
> have their own versioning at 0.X until stability is achieved, based on
> arrow2's code base. The migration of the `arrow` crate to arrow2's API
> would be to re-export from the smaller crates (e.g. `pub use
> arrow_core::array`).
>
> [1] https://crates.io/crates/lambda_runtim

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Jorge Cardoso Leitão
Hi,

Thanks for your input.

Every time there is a new major release, all new development shifts towards
that new API and users of previous APIs are left behind. It is not just a
matter of SemVer and size of version numbers, there is a whole development
shift to be on top of the new API.

I disagree that a software that has a major release every 3 months and no
maintenance window over previous versions is stable. I alluded to the Tokio
example because Tokio 1.0 recently became the runtime of rust-based AWS
lambda functions [1]; this commitment is only possible by enforcing API
stability and maintenance beyond a 3 month period (at least 3 years in
their case).

Also, imo the current major version number is not meaningless: divided by
the software age, it constitutes the historical release pattern and is
usually a good predictor of the pattern used in future releases.

The evidence is that we haven't been able to support any version for any
period of time; recently, Andrew has been doing amazing work at supporting
the latest version for a period of 3 months. I.e. an application that
depends on `arrow = ^5.0` has a support window of 3 months. Given that we
have not backported any security fixes to previous versions, it is
reasonable to assume that security patches are also applied within a 3
month period only.

As contributor of arrow2, I would rather not have arrow2 under Apache Arrow
than having to release it under its current versioning and scheduling (this
is similar to some of Julia's concerns). As a contributor to the Apache
Arrow, I currently cannot guarantee a maintenance window over arrow-rs for
any period of time because it is unsafe by design and I do not have the
motivation to fix it. As both, I am confident that the core arrow2 will
soon reach a point where we can live with and develop on top of it for at
least a year. This is not true to the whole API surface, though: there are
APIs that we will need to change more often until stability can be promised.

So, I am requesting that we tie the discussion of arrow2 to how it will be
released.

Could a middle ground be somewhere along the lines of splitting the crate
in smaller crates that are versioned independently. I.e. continue to
release `arrow` under the same versioning and cadence, and create 3 new
crates, arrow-core, arrow-compute, and arrow-io (see also [2]) that would
have their own versioning at 0.X until stability is achieved, based on
arrow2's code base. The migration of the `arrow` crate to arrow2's API
would be to re-export from the smaller crates (e.g. `pub use
arrow_core::array`).

[1] https://crates.io/crates/lambda_runtime/0.3.1/dependencies
[2] https://github.com/jorgecarleitao/arrow2/issues/257

Best,
Jorge


On Thu, Aug 5, 2021 at 11:53 PM Adam Lippai  wrote:

> Not taking sides, just two technical notes below.
>
> Server.org clearly defines (
> https://semver.org/#how-do-i-know-when-to-release-100) the versions
> >1.0.0.
> * If it's used in production, it's 1.0.0.
> * If it provides an API others depend on then it's 1.0.0.
> * If you intend to keep backward compatibility, it's 1.0.0.
> Tl;Dr 1.0.0 represents a version which from point we guarantee that
> non-production releases are marked (alpha, beta, rc) and breaking (API)
> changes, backwards incompatible changes result in major version bump. This
> we already do, 4x per year.
>
> The second fact is that arrow2 uses the arrow name, but it doesn't have
> apache governance. It's not released from GitHub.com/apache, there are no
> formal releases, there are no votes. This is not correct or fair usage of
> the brand (on the same level as DataFuse, or db-benchmark calling a custom
> R implementation arrow) even if it's "unofficial". My understanding is that
> arrow2 can be an unofficial implementation with a different name or an
> arrow-rs experiment with the intention to merge the code, but not both.
>
> I think both issues could be solved and I really value and like the arrow2
> work so far. That's the right way. I hope we'll see it in prod either way
> as soon as it's ready.
>
> Best regards,
> Adam Lippai
>
> On Wed, Aug 4, 2021, 08:25 QP Hou  wrote:
>
> > Just my two cents.
> >
> > I think we all have the same goal here, which is to accelerate the
> > transitioning of arrow to arrow2 as the official arrow rust
> > implementation.
> >
> > In my opinion, the biggest gain we can get from merging two projects
> > into one repo is to have some kind of a policy to enforce that every
> > new feature/test added to the current arrow implementation also  needs
> > to be added to the arrow2 implementation. This way, we can make sure
> > the gap between arrow and arrow2 is closing on every iteration.
> > Without this, I tend to agree with Jorge that merging two repos would
> > add more overhead to his work and slow him down.
> >
> > For those who want to contribute to arrow2 to accelerate the
> > transition, I don't think they would have problem sending PRs to the
> > arrow2 repo. For th