Re: [Discuss] [Rust] Arrow2/parquet2 going foward
I agree with you both. Users would love to have a project with multi-year maintenance with a completely stable backwards compatible API (aka what tokio has promised) that does everything they need. However, building such software is (very) costly both initially and then much more so for the ongoing maintenance; Until there is a need (demonstrated by the willingness to pay the cost) from users of Rust/Arrow for such maintenance I don't see how to make it happen. Evidence of the lack of demand for longer 'supported' releases in my mind: No one I know of has asked for, let alone volunteered to help create an arrow-rs maintenance release (e.g. 4.4.1) with just bug fixes. We have all the process setup to make it happen, but no one cares yet. I agree with Adam that there is middle ground here and I don't see any insurmountable incompatibilities in release versions or processes. Andrew On Fri, Aug 6, 2021 at 5:31 AM Adam Lippai wrote: > Hi, > > Thanks for the detailed answer. > > In contrast to my previous email, my opinionated part: > > Generally I like the idea of smaller crates, it helps with a lot of stuff > (different targets, build time), but those benefits can be achieved by > feature gates too. > The upside would be out-of-sync crate releases. > > Maintenance is important, historically speaking I've seen it solved for > open source by private companies offering it as a paid service. > You are right that currently only 3 months of support is provided for free, > but personally I don't see that as an issue. > There are professional libraries and software with close to 100% market > share in their field which support the last or last two versions only > (Chrome, OS-es, compilers). > I find it hard to imagine we'd want to do it *better*, that sounds to be an > illusion, but I'd like to be wrong on this one :) > Professionally speaking, when picking projects, having Apache (or other) > governance and community is more important for the businesses I worked > with, than the release schedule or API stability / versioning. > > > Based on the above and that there are about a dozen active Rust arrow > contributors, any promise for reliable maintenance over years would be a > lie in my eyes. > DataFusion, Polars, odbc2parquet and others had issues with the changes > being too slow, not too fast. > > I'm a big advocate of middle grounds and I still believe that your efforts > and ideal setup is compatible with arrow-rs, nobody would stop you creating > a 5.23.0 release next to the 6.1.0 if you'd want to backport anything and > nobody would stop you cutting an out-of-schedule 6.2 or even 7.0 release if > it's to ensure security. The frequent Apache release process - which we > were afraid of - was smooth so far, with surprisingly nice support from > members of different languages / implementations. > > Also I believe that any plan you'd have turning arrow2 into arrow-rs 6.0 > would be more than welcome on a public vote, along with the technical > chances you propose (eg. cutting a separate arrow-io crate). > > > At least 6 key members showed their excitement for your changes in this > thread and even more on Slack/GitHub ;) > > Best regards, > Adam Lippai > > On Fri, Aug 6, 2021 at 10:07 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > Thanks for your input. > > > > Every time there is a new major release, all new development shifts > towards > > that new API and users of previous APIs are left behind. It is not just a > > matter of SemVer and size of version numbers, there is a whole > development > > shift to be on top of the new API. > > > > I disagree that a software that has a major release every 3 months and no > > maintenance window over previous versions is stable. I alluded to the > Tokio > > example because Tokio 1.0 recently became the runtime of rust-based AWS > > lambda functions [1]; this commitment is only possible by enforcing API > > stability and maintenance beyond a 3 month period (at least 3 years in > > their case). > > > > Also, imo the current major version number is not meaningless: divided by > > the software age, it constitutes the historical release pattern and is > > usually a good predictor of the pattern used in future releases. > > > > The evidence is that we haven't been able to support any version for any > > period of time; recently, Andrew has been doing amazing work at > supporting > > the latest version for a period of 3 months. I.e. an application that > > depends on `arrow = ^5.0` has a support window of 3 months. Given that we > > have not backported any security fixes to previous versions, it is > > reasonable to assume that security patches are also applied within a 3 > > month period only. > > > > As contributor of arrow2, I would rather not have arrow2 under Apache > Arrow > > than having to release it under its current versioning and scheduling > (this > > is similar to some of Julia's concerns). As a contributor to the Apache > > Arr
Re: [Discuss] [Rust] Arrow2/parquet2 going foward
Hi, Thanks for the detailed answer. In contrast to my previous email, my opinionated part: Generally I like the idea of smaller crates, it helps with a lot of stuff (different targets, build time), but those benefits can be achieved by feature gates too. The upside would be out-of-sync crate releases. Maintenance is important, historically speaking I've seen it solved for open source by private companies offering it as a paid service. You are right that currently only 3 months of support is provided for free, but personally I don't see that as an issue. There are professional libraries and software with close to 100% market share in their field which support the last or last two versions only (Chrome, OS-es, compilers). I find it hard to imagine we'd want to do it *better*, that sounds to be an illusion, but I'd like to be wrong on this one :) Professionally speaking, when picking projects, having Apache (or other) governance and community is more important for the businesses I worked with, than the release schedule or API stability / versioning. Based on the above and that there are about a dozen active Rust arrow contributors, any promise for reliable maintenance over years would be a lie in my eyes. DataFusion, Polars, odbc2parquet and others had issues with the changes being too slow, not too fast. I'm a big advocate of middle grounds and I still believe that your efforts and ideal setup is compatible with arrow-rs, nobody would stop you creating a 5.23.0 release next to the 6.1.0 if you'd want to backport anything and nobody would stop you cutting an out-of-schedule 6.2 or even 7.0 release if it's to ensure security. The frequent Apache release process - which we were afraid of - was smooth so far, with surprisingly nice support from members of different languages / implementations. Also I believe that any plan you'd have turning arrow2 into arrow-rs 6.0 would be more than welcome on a public vote, along with the technical chances you propose (eg. cutting a separate arrow-io crate). At least 6 key members showed their excitement for your changes in this thread and even more on Slack/GitHub ;) Best regards, Adam Lippai On Fri, Aug 6, 2021 at 10:07 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > Thanks for your input. > > Every time there is a new major release, all new development shifts towards > that new API and users of previous APIs are left behind. It is not just a > matter of SemVer and size of version numbers, there is a whole development > shift to be on top of the new API. > > I disagree that a software that has a major release every 3 months and no > maintenance window over previous versions is stable. I alluded to the Tokio > example because Tokio 1.0 recently became the runtime of rust-based AWS > lambda functions [1]; this commitment is only possible by enforcing API > stability and maintenance beyond a 3 month period (at least 3 years in > their case). > > Also, imo the current major version number is not meaningless: divided by > the software age, it constitutes the historical release pattern and is > usually a good predictor of the pattern used in future releases. > > The evidence is that we haven't been able to support any version for any > period of time; recently, Andrew has been doing amazing work at supporting > the latest version for a period of 3 months. I.e. an application that > depends on `arrow = ^5.0` has a support window of 3 months. Given that we > have not backported any security fixes to previous versions, it is > reasonable to assume that security patches are also applied within a 3 > month period only. > > As contributor of arrow2, I would rather not have arrow2 under Apache Arrow > than having to release it under its current versioning and scheduling (this > is similar to some of Julia's concerns). As a contributor to the Apache > Arrow, I currently cannot guarantee a maintenance window over arrow-rs for > any period of time because it is unsafe by design and I do not have the > motivation to fix it. As both, I am confident that the core arrow2 will > soon reach a point where we can live with and develop on top of it for at > least a year. This is not true to the whole API surface, though: there are > APIs that we will need to change more often until stability can be > promised. > > So, I am requesting that we tie the discussion of arrow2 to how it will be > released. > > Could a middle ground be somewhere along the lines of splitting the crate > in smaller crates that are versioned independently. I.e. continue to > release `arrow` under the same versioning and cadence, and create 3 new > crates, arrow-core, arrow-compute, and arrow-io (see also [2]) that would > have their own versioning at 0.X until stability is achieved, based on > arrow2's code base. The migration of the `arrow` crate to arrow2's API > would be to re-export from the smaller crates (e.g. `pub use > arrow_core::array`). > > [1] https://crates.io/crates/lambda_runtim
Re: [Discuss] [Rust] Arrow2/parquet2 going foward
Hi, Thanks for your input. Every time there is a new major release, all new development shifts towards that new API and users of previous APIs are left behind. It is not just a matter of SemVer and size of version numbers, there is a whole development shift to be on top of the new API. I disagree that a software that has a major release every 3 months and no maintenance window over previous versions is stable. I alluded to the Tokio example because Tokio 1.0 recently became the runtime of rust-based AWS lambda functions [1]; this commitment is only possible by enforcing API stability and maintenance beyond a 3 month period (at least 3 years in their case). Also, imo the current major version number is not meaningless: divided by the software age, it constitutes the historical release pattern and is usually a good predictor of the pattern used in future releases. The evidence is that we haven't been able to support any version for any period of time; recently, Andrew has been doing amazing work at supporting the latest version for a period of 3 months. I.e. an application that depends on `arrow = ^5.0` has a support window of 3 months. Given that we have not backported any security fixes to previous versions, it is reasonable to assume that security patches are also applied within a 3 month period only. As contributor of arrow2, I would rather not have arrow2 under Apache Arrow than having to release it under its current versioning and scheduling (this is similar to some of Julia's concerns). As a contributor to the Apache Arrow, I currently cannot guarantee a maintenance window over arrow-rs for any period of time because it is unsafe by design and I do not have the motivation to fix it. As both, I am confident that the core arrow2 will soon reach a point where we can live with and develop on top of it for at least a year. This is not true to the whole API surface, though: there are APIs that we will need to change more often until stability can be promised. So, I am requesting that we tie the discussion of arrow2 to how it will be released. Could a middle ground be somewhere along the lines of splitting the crate in smaller crates that are versioned independently. I.e. continue to release `arrow` under the same versioning and cadence, and create 3 new crates, arrow-core, arrow-compute, and arrow-io (see also [2]) that would have their own versioning at 0.X until stability is achieved, based on arrow2's code base. The migration of the `arrow` crate to arrow2's API would be to re-export from the smaller crates (e.g. `pub use arrow_core::array`). [1] https://crates.io/crates/lambda_runtime/0.3.1/dependencies [2] https://github.com/jorgecarleitao/arrow2/issues/257 Best, Jorge On Thu, Aug 5, 2021 at 11:53 PM Adam Lippai wrote: > Not taking sides, just two technical notes below. > > Server.org clearly defines ( > https://semver.org/#how-do-i-know-when-to-release-100) the versions > >1.0.0. > * If it's used in production, it's 1.0.0. > * If it provides an API others depend on then it's 1.0.0. > * If you intend to keep backward compatibility, it's 1.0.0. > Tl;Dr 1.0.0 represents a version which from point we guarantee that > non-production releases are marked (alpha, beta, rc) and breaking (API) > changes, backwards incompatible changes result in major version bump. This > we already do, 4x per year. > > The second fact is that arrow2 uses the arrow name, but it doesn't have > apache governance. It's not released from GitHub.com/apache, there are no > formal releases, there are no votes. This is not correct or fair usage of > the brand (on the same level as DataFuse, or db-benchmark calling a custom > R implementation arrow) even if it's "unofficial". My understanding is that > arrow2 can be an unofficial implementation with a different name or an > arrow-rs experiment with the intention to merge the code, but not both. > > I think both issues could be solved and I really value and like the arrow2 > work so far. That's the right way. I hope we'll see it in prod either way > as soon as it's ready. > > Best regards, > Adam Lippai > > On Wed, Aug 4, 2021, 08:25 QP Hou wrote: > > > Just my two cents. > > > > I think we all have the same goal here, which is to accelerate the > > transitioning of arrow to arrow2 as the official arrow rust > > implementation. > > > > In my opinion, the biggest gain we can get from merging two projects > > into one repo is to have some kind of a policy to enforce that every > > new feature/test added to the current arrow implementation also needs > > to be added to the arrow2 implementation. This way, we can make sure > > the gap between arrow and arrow2 is closing on every iteration. > > Without this, I tend to agree with Jorge that merging two repos would > > add more overhead to his work and slow him down. > > > > For those who want to contribute to arrow2 to accelerate the > > transition, I don't think they would have problem sending PRs to the > > arrow2 repo. For th