Hi,

As briefly discussed in a recent email thread, I have been experimenting
with re-writing the Rust parquet implementation. I have not advertised this
much as I was very sceptical that this would work. I am now confident that
it can, and thus would like to share more details.

parquet2 [1] is a rewrite of the parquet crate taking security,
performance, and parallelism as requirements.

Here are the highlights so far:

- Security: *no use of unsafe*. All invariants about memory and thread
safety are proven by the Rust compiler (an audit to its 3 mandatory + 5
optional compressors is still required). (compare e.g. ARROW-10920).

- Performance: to the best of my benchmarking capabilities, *3-15x faster*
than the parquet crate, both reading and writing to arrow. It has about the
same performance as pyarrow/c++. These numbers correspond to a single plain
page with 10% nulls and increase with increasing slot number / page size
(which imo is a relevant unit of work). See [2] for plots, numbers and
references to exact commits.

- Features: it reads parquet optional primitive types, V1 and V2,
dictionary- and non-dictionary pages, rep and def levels, and metadata. It
reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN
and RLE encoding. No delta-encoding yet. No statistics yet.

- Integration: it is integration-tested against parquet generated by
pyarrow==3, and round trip tests for the write.

The public API is just functions and iterators generics. An important
design choice is that there is a strict separation between IO-bound
operations (read and seek) and CPU-bound operations (decompress, decode,
deserialize). This gives consumers (read datafusion, polars, etc.) the
choice of deciding how they want to parallelize the work among threads.

I investigated async and AFAIU we first need to add support to it on the
thrift crate [3], as it currently does not have an API to use the
futures::AsyncRead and futures::AsyncSeek traits.

parquet2 is in-memory model -independent; it just exposes an API to read
the parquet format according to the spec. It delegates to consumers how to
deserialize the pages to it (I implemented it for arrow2 and native rust),
offering a toolkit to help them. imo this is important because imo it
should be the in-memory representation to decide how to best convert a
decompressed page to memory.

The development is happening on my own repo, but I was hoping to bring it
to ASF (experimental repo?). if you think that Apache Arrow could be a
place to host it (Apache Parquet is another option?).

[1] https://github.com/jorgecarleitao/parquet2
[2]
https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
[3] https://issues.apache.org/jira/browse/THRIFT-4777

Best,
Jorge

Reply via email to