Hi,

I would like to offer an update on this. I continued to work heavily on
this hypothesis on separate repos [1, 2], as this required a ground-up
refactor.

Following is the current status, and at the end some options that we can
think about.

TL;DR: I would like to gauge the communities' interest in making arrow2 and
parquet2 experimental repos in Apache Arrow. IMO they are safer, faster,
more maintainable and equally compatible with both the arrow spec and
parquet spec.

# Specification and interoperability

IPC:

All integration tests that in 4.1.0 runs pass against apache/arrow@master.
Furthermore, it passes the following tests that 4.1.0 does not:
* big endian IPC producer and consumer
* decimal128
* interval
* nested_large_offsets

FFI: All integration tests that 4.1.0 runs pass against pyarrow==4.0.

Arrow-Flight: it is the same code; I am not sure the tests in 4.1.0 are
passing or are skipped.

parquet: arrow2 tests against parquet files generated by pyarrow under
different configurations:
* physical and logical types
* page versions
* repetition levels
* dictionary encoding

# Safety

* arrow2 addresses all security vulnerabilities (all our +20 issues labeled
with "security" [3] and more currently not encapsulated in any issue) and
unsafety issues. In particular,
* all tests pass under MIRI checks
* all unsafe APIs are marked as unsafe
* parquet2 does not use unsafe

# Maintenance

* arrow + parquet has a total of 56k+36k LOC, excluding headers
* arrow2 + parquet2 has a total of 50k+7k LOC, excluding headers
* arrow2 coverage is 76%, arrow is 88%
* parquet2 is "unsafe"-free

# Features

Non-spec wise (e.g. compute, utils, parquet), the crate has about 90% of
all the features in 4.0. What is missing:
* nested read and write parquet (lists and structs)
* missing some new features since 4.0.0

OTOH, it has the following additional features:

* API to read CSV in parallel
* API to read parquet in parallel
* checked_X, saturating_X, overflowing_X operations (i.e. non-panic
versions of add, subtract, etc.)
* arithmetics ops over dates, timestamps and durations
* display for every logical type
* more cast operations
* merge-sort kernel
* vectorized hashing

# Performance

* 3-15x reading and writing parquet, and APIs to read them in parallel (see
[4])
* faster IPC read
* arithmetics and the like are about the same performance as arrow 4.0
compiled with SIMD (available in nightly), ~1.5x faster without SIMD and
nightly.
* Some kernels degrade by about 20% due to bound checks (e.g. boolean
"take" in arrow 4.1 allows out of bound reads and is thus faster).
* Sort and filter ~2x faster. See [5,6]

# Interoperability with DataFusion

I have an experimental PR [7] in DataFusion. The THPC1 yields the same
result and has about the same performance (datafusion can perform
out-of-bound reads from arrow...), without group bys, it is ~2x faster.

# Process / Community

This is a "big chunk of code" type of situation developed over an external
repo. I tried to keep folks informed of the status and what was being done;
the mono-repo at the time was really difficult to cope with. With this said:

* I proposed an experimental repo mechanism so that we can conduct this
type of activities (now merged [8])
* I am not merging PRs that introduce new major API, so that the community
can weight in

# Licensing

All code is licensed under MIT and Apache; contributors are required to
accept any of these as part of their contributions. I can work to handle
this part with the incubator / ASF ahead of any potential code movement.

-----------------

Main question: what do we do?

Some ideas:

1. PR all this to the arrow-rs repo
2. push this to a branch in arrow-rs
3. move this to an experimental repo within ASF and work on it until we
have feature parity (e.g. read and write nested types to/from parquet), and
then apply 1 or 2
4. do nothing

Concerns with option 1:
* development will continue to happen outside ASF
* no easy way to collaborate: issue tracking for this code outside ASF
* no easy way for the community to weight in over changes to the API prior
to merge
Concerns with option 2:
* issue tracking about branches is confusing, specially in creating change
logs
* PRs to branches is confusing
* no easy way for the community to weight in over changes to the API prior
to merge
Concerns with option 3:
* it would be the first experimental repo, thus some risk
Concerns with option 4:
* for the time being this would continue to be a project independent of the
Apache Arrow
* I would release 0.1 to crate.io as "arrow-safe" or something, as there is
demand for it.

I would be in favor of option 3 for the following reason: I do not think it
is useful to PR a -92k +57k change without giving the community extensive
time to evaluate, contribute, and time for a proper discussion. In this
context, my idea here was to give some time for the ideas to mature within
ASF, and only then even consider a switch.

Thanks,
Jorge

[1] https://github.com/jorgecarleitao/arrow2
[2] https://github.com/jorgecarleitao/parquet2
[3]
https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3Asecurity
[4]
https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045
[5]
https://jorgecarleitao.medium.com/safe-analytics-with-rust-and-arrow-564f05107dd2
[6]
https://docs.google.com/spreadsheets/d/1hLKsqJaw_VLjtJCgQ635R9iHDNYwZscE0OT1omdZuwg/edit#gid=402497043
[7] https://github.com/apache/arrow-datafusion/pull/68
[8] https://issues.apache.org/jira/browse/ARROW-12643


On Sun, Feb 7, 2021 at 2:42 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Over the past 4 months, I have been growing more and more frustrated by
> the amount of undefined behaviour that I am finding and fixing on the Rust
> implementation. I would like to open the discussion of a broader overview
> about the problem in light of our current knowledge and what Rust enables
> as well as offer a solution to the bigger problem.
>
> Just to give you a gist of the seriousness of the issue, the following
> currently compiles, runs, and is undefined behavior in Rust:
>
> let buffer = Buffer::from(&[0i32, 2i32]);let data = 
> ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);let 
> array = Float64Array::from(Arc::new(data));
> println!("{:?}", array.value(1));
>
> I would like to propose a major refactor of the crate around physical
> traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
> compile time, thereby avoiding things like the example above from happening
> again.
>
> So far, I was able to reproduce all core features of the arrow crate
> (nested types, dynamic typing, FFI, memory alignment, performance) by using
> `Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
> RawPointer altogether.
>
> Safety-wise, it significantly limits the usage of `unsafe` on higher end
> APIs, it has a single transmute (the bit chunk iterator one), and a
> guaranteed safe public API (which is not the case in our master, as shown
> above).
>
> Performance wise, it yields a 1.3x improvement over the current master
> (after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
> take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
> should have other major performance improvements.
>
> API wise, it simplifies the traits that we have for memory layout as well
> as the handling of bitmaps, offsets, etc.
>
> The proposal is drafted as a README
> <https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
> repo that I created specifically for this from the ground up, and the full
> set of changes are in a PR
> <https://github.com/jorgecarleitao/arrow2/pull/1> so that anyone can view
> and comment on it. I haven't made any PR to master because this is too
> large to track as a diff against master, and is beyond the point, anyways.
>
> I haven't ported most of the crate as I only tried the non-trivial
> features (memory layout, bitmaps, FFI, dynamic typing, nested types).
>
> I would highly appreciate your thoughts about it.
>
> Best,
> Jorge
>
>

Reply via email to