Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Andy Grove Thu, 27 May 2021 07:24:18 -0700

I don't have a very strong opinion on new repo vs branch but having a new
repo seems simpler and less overhead to me. I think it makes this effort
far more visible to the community and is more likely to get more people
involved.


Looking at this purely from the DataFusion/Ballista point of view, what I
would be interested in would be having a branch of DF that uses arrow2 and
once that branch has all tests passing and can run queries with performance
that is at least as good as the original arrow crate, then cut over.

However, for developers using the arrow APIs directly, I don't see an easy
path. We either try and gradually PR the changes in (which seems really
hard given that there are significant changes to APIs and internal data
structures) or we port some portion of the existing tests over to arrow2
and then make that the official crate once all test pass.

If I were doing the work, I think I would try and verify the new crate and
cut over rather than try and gradually PR the changes in, assuming that the
community is behind this approach and see the value of the new crate.

Andy.













On Thu, May 27, 2021 at 6:58 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I think given the size and scope of the work, there's a stronger
> argument for having an IP clearance for this code (as compared with
> python-datafusion).
>
> On Thu, May 27, 2021 at 5:45 AM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I am not opposed to a new repo.
> >
> > However I believe that the largest barrier to the community really
> getting
> > their heads around / evaluating arrow2 is its sheer size. -92k +57k isn't
> > something I am likely to get my head in any level of detail until I
> > actively work with it for a while.
> >
> > The best way to get community input, I think, is to start the process of
> > getting arrow2 into arrow-rs via PRs. While splitting it up into multiple
> > PRs somehow is likely not easy and would require lots more work, starting
> > to get this work into arrow-rs in smaller chunks would be the ideal
> outcome
> > in my opinion.
> >
> > Therefore, I don't see any benefit to a new repo -- I think a branch in
> > arrow-rs (or a fork) would work just as well. But again, I am not opposed
> > to a new repo either.
> >
> > Andrew
> >
> > On Wed, May 26, 2021 at 3:47 AM Fernando Herrera <
> > fernando.j.herr...@gmail.com> wrote:
> >
> > > Thanks Jorge for the update and the continuous development on a
> > > safer version of arrow.
> > >
> > > I would like to give my support for option 3 as well. IMHO it will give
> > > arrow2 the exposition it needs to be considered by a wider set of
> > > users. This exposition will open the possibility to receive more
> > > participation regarding missing features required to integrate
> > > arrow2 to datafusion and ballista.
> > >
> > > It will also give peace of mind to arrow users that arrow2 will
> > > follow the apache way, meaning that its development will be stable,
> > > supported and community driven. I have dealt with this issue myself,
> > > where I have suggested the use of arrow2 for new projects, only
> > > to be discarded because of the impression that it isn't supported
> > > by the apache community; even after seeing the advantages the
> > > project presents.
> > >
> > >
> > > Fernando
> > >
> > > On Wed, May 26, 2021 at 6:38 AM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to offer an update on this. I continued to work heavily
> on
> > > > this hypothesis on separate repos [1, 2], as this required a
> ground-up
> > > > refactor.
> > > >
> > > > Following is the current status, and at the end some options that we
> can
> > > > think about.
> > > >
> > > > TL;DR: I would like to gauge the communities' interest in making
> arrow2
> > > and
> > > > parquet2 experimental repos in Apache Arrow. IMO they are safer,
> faster,
> > > > more maintainable and equally compatible with both the arrow spec and
> > > > parquet spec.
> > > >
> > > > # Specification and interoperability
> > > >
> > > > IPC:
> > > >
> > > > All integration tests that in 4.1.0 runs pass against
> > > apache/arrow@master.
> > > > Furthermore, it passes the following tests that 4.1.0 does not:
> > > > * big endian IPC producer and consumer
> > > > * decimal128
> > > > * interval
> > > > * nested_large_offsets
> > > >
> > > > FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.
> > > >
> > > > Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0
> are
> > > > passing or are skipped.
> > > >
> > > > parquet: arrow2 tests against parquet files generated by pyarrow
> under
> > > > different configurations:
> > > > * physical and logical types
> > > > * page versions
> > > > * repetition levels
> > > > * dictionary encoding
> > > >
> > > > # Safety
> > > >
> > > > * arrow2 addresses all security vulnerabilities (all our +20 issues
> > > labeled
> > > > with "security" [3] and more currently not encapsulated in any
> issue) and
> > > > unsafety issues. In particular,
> > > > * all tests pass under MIRI checks
> > > > * all unsafe APIs are marked as unsafe
> > > > * parquet2 does not use unsafe
> > > >
> > > > # Maintenance
> > > >
> > > > * arrow + parquet has a total of 56k+36k LOC, excluding headers
> > > > * arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
> > > > * arrow2 coverage is 76%, arrow is 88%
> > > > * parquet2 is "unsafe"-free
> > > >
> > > > # Features
> > > >
> > > > Non-spec wise (e.g. compute, utils, parquet), the crate has about
> 90% of
> > > > all the features in 4.0. What is missing:
> > > > * nested read and write parquet (lists and structs)
> > > > * missing some new features since 4.0.0
> > > >
> > > > OTOH, it has the following additional features:
> > > >
> > > > * API to read CSV in parallel
> > > > * API to read parquet in parallel
> > > > * checked_X, saturating_X, overflowing_X operations (i.e. non-panic
> > > > versions of add, subtract, etc.)
> > > > * arithmetics ops over dates, timestamps and durations
> > > > * display for every logical type
> > > > * more cast operations
> > > > * merge-sort kernel
> > > > * vectorized hashing
> > > >
> > > > # Performance
> > > >
> > > > * 3-15x reading and writing parquet, and APIs to read them in
> parallel
> > > (see
> > > > [4])
> > > > * faster IPC read
> > > > * arithmetics and the like are about the same performance as arrow
> 4.0
> > > > compiled with SIMD (available in nightly), ~1.5x faster without SIMD
> and
> > > > nightly.
> > > > * Some kernels degrade by about 20% due to bound checks (e.g. boolean
> > > > "take" in arrow 4.1 allows out of bound reads and is thus faster).
> > > > * Sort and filter ~2x faster. See [5,6]
> > > >
> > > > # Interoperability with DataFusion
> > > >
> > > > I have an experimental PR [7] in DataFusion. The THPC1 yields the
> same
> > > > result and has about the same performance (datafusion can perform
> > > > out-of-bound reads from arrow...), without group bys, it is ~2x
> faster.
> > > >
> > > > # Process / Community
> > > >
> > > > This is a "big chunk of code" type of situation developed over an
> > > external
> > > > repo. I tried to keep folks informed of the status and what was being
> > > done;
> > > > the mono-repo at the time was really difficult to cope with. With
> this
> > > > said:
> > > >
> > > > * I proposed an experimental repo mechanism so that we can conduct
> this
> > > > type of activities (now merged [8])
> > > > * I am not merging PRs that introduce new major API, so that the
> > > community
> > > > can weight in
> > > >
> > > > # Licensing
> > > >
> > > > All code is licensed under MIT and Apache; contributors are required
> to
> > > > accept any of these as part of their contributions. I can work to
> handle
> > > > this part with the incubator / ASF ahead of any potential code
> movement.
> > > >
> > > > -----------------
> > > >
> > > > Main question: what do we do?
> > > >
> > > > Some ideas:
> > > >
> > > > 1. PR all this to the arrow-rs repo
> > > > 2. push this to a branch in arrow-rs
> > > > 3. move this to an experimental repo within ASF and work on it until
> we
> > > > have feature parity (e.g. read and write nested types to/from
> parquet),
> > > and
> > > > then apply 1 or 2
> > > > 4. do nothing
> > > >
> > > > Concerns with option 1:
> > > > * development will continue to happen outside ASF
> > > > * no easy way to collaborate: issue tracking for this code outside
> ASF
> > > > * no easy way for the community to weight in over changes to the API
> > > prior
> > > > to merge
> > > > Concerns with option 2:
> > > > * issue tracking about branches is confusing, specially in creating
> > > change
> > > > logs
> > > > * PRs to branches is confusing
> > > > * no easy way for the community to weight in over changes to the API
> > > prior
> > > > to merge
> > > > Concerns with option 3:
> > > > * it would be the first experimental repo, thus some risk
> > > > Concerns with option 4:
> > > > * for the time being this would continue to be a project independent
> of
> > > the
> > > > Apache Arrow
> > > > * I would release 0.1 to crate.io as "arrow-safe" or something, as
> there
> > > > is
> > > > demand for it.
> > > >
> > > > I would be in favor of option 3 for the following reason: I do not
> think
> > > it
> > > > is useful to PR a -92k +57k change without giving the community
> extensive
> > > > time to evaluate, contribute, and time for a proper discussion. In
> this
> > > > context, my idea here was to give some time for the ideas to mature
> > > within
> > > > ASF, and only then even consider a switch.
> > > >
> > > > Thanks,
> > > > Jorge
> > > >
> > > > [1] https://github.com/jorgecarleitao/arrow2
> > > > [2] https://github.com/jorgecarleitao/parquet2
> > > > [3]
> > > >
> > > >
> > >
> https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
> > > > [4]
> > > >
> > > >
> > >
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
> > > > [5]
> > > >
> > > >
> > >
> https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
> > > > [6]
> > > >
> > > >
> > >
> https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
> > > > [7] https://github.com/apache/arrow-datafusion/pull/68
> > > > [8] https://issues.apache.org/jira/browse/ARROW-12643
> > > >
> > > >
> > > > On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
> > > > jorgecarlei...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Over the past 4 months, I have been growing more and more
> frustrated by
> > > > > the amount of undefined behaviour that I am finding and fixing on
> the
> > > > Rust
> > > > > implementation. I would like to open the discussion of a broader
> > > overview
> > > > > about the problem in light of our current knowledge and what Rust
> > > enables
> > > > > as well as offer a solution to the bigger problem.
> > > > >
> > > > > Just to give you a gist of the seriousness of the issue, the
> following
> > > > > currently compiles, runs, and is undefined behavior in Rust:
> > > > >
> > > > > let buffer = Buffer::from(&[0i32, 2i32]);let data =
> > > > ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
> vec![]);let
> > > > array = Float64Array::from(Arc::new(data));
> > > > > println!("{:?}", array.value(1));
> > > > >
> > > > > I would like to propose a major refactor of the crate around
> physical
> > > > > traits, Buffer, MutableBuffer and ArrayData to make our code
> type-safe
> > > at
> > > > > compile time, thereby avoiding things like the example above from
> > > > happening
> > > > > again.
> > > > >
> > > > > So far, I was able to reproduce all core features of the arrow
> crate
> > > > > (nested types, dynamic typing, FFI, memory alignment, performance)
> by
> > > > using
> > > > > `Buffer<T: NativeType>` instead of `Buffer` and removing
> `ArrayData`
> > > and
> > > > > RawPointer altogether.
> > > > >
> > > > > Safety-wise, it significantly limits the usage of `unsafe` on
> higher
> > > end
> > > > > APIs, it has a single transmute (the bit chunk iterator one), and a
> > > > > guaranteed safe public API (which is not the case in our master, as
> > > shown
> > > > > above).
> > > > >
> > > > > Performance wise, it yields a 1.3x improvement over the current
> master
> > > > > (after this fix <https://github.com/apache/arrow/pull/9301> of UB
> on
> > > the
> > > > > take kernel, 1.7x prior to it) for the `take` kernel for
> primitives. I
> > > > > should have other major performance improvements.
> > > > >
> > > > > API wise, it simplifies the traits that we have for memory layout
> as
> > > well
> > > > > as the handling of bitmaps, offsets, etc.
> > > > >
> > > > > The proposal is drafted as a README
> > > > > <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md>
> on
> > > a
> > > > > repo that I created specifically for this from the ground up, and
> the
> > > > full
> > > > > set of changes are in a PR
> > > > > <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone
> can
> > > > view
> > > > > and comment on it. I haven't made any PR to master because this is
> too
> > > > > large to track as a diff against master, and is beyond the point,
> > > > anyways.
> > > > >
> > > > > I haven't ported most of the crate as I only tried the non-trivial
> > > > > features (memory layout, bitmaps, FFI, dynamic typing, nested
> types).
> > > > >
> > > > > I would highly appreciate your thoughts about it.
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > >
> > > >
> > >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to