Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Jorge Cardoso Leitão Sat, 05 Jun 2021 23:47:06 -0700

Hi,

Thanks a lot for your feedback. I agree with all the arguments put forward,
including Andrew's point about the large change.


I tried a gradual 4 months ago, but it was really difficult and I gave up.
I estimate that the work involved is half the work of writing parquet2 and
arrow2 in the first place. The internal dependency on ArrayData (the main
culprit of the unsafe) on arrow-rs is so prevalent that all core components
need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
compute, SIMD). I personally do not have the motivation to do it, though.

Jed, the public API changes are small for end users. A typical migration is
[1]. I agree that we can further reduce the change-set by keeping legacy
interfaces available.

Andy, on my machine, the current benchmarks on query 1 yield:

type, master (ms), PR [2] for arrow2+parquet2 (ms)
memory (-m): 332.9, 239.6
load (the initial time in -m with --format parquet): 5286.0, 3043.0
parquet format: 1316.1, 930.7
tbl format: 5297.3, 5383.1

i.e. I am observing some improvements. Queries with joins are still slower.
The pruning of parquet groups and pages based on stats are not yet there; I
am working on them.

I agree that this should go through IP clearance. I will start this
process. My thinking would be to create two empty repos on apache/*, and
create 2 PRs from the main branches of each of my repos to those repos, and
only merge them once IP is cleared. Would that be a reasonable process, Wes?

Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?

Best,
Jorge

[1]
https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
[2] https://github.com/apache/arrow-datafusion/pull/68


On Fri, May 28, 2021 at 5:22 AM Josh Taylor <joshuatayl...@gmail.com> wrote:

> I played around with it, for my use case I really like the new way of
> writing CSVs, it's much more obvious. I love the `read_stream_metadata`
> function as well.
>
> I'm seeing a very slight speed (~8ms) improvement on my end, but I read a
> bunch of files in a directory and spit out a CSV, the bottleneck is the
> parsing of lots of files, but it's pretty quick per file.
>
> old:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 159ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 160ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 158ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 160ms
>
> new:
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 120224
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 123144
> bytes took 1ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10
> 17127928 bytes took 157ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11
> 17127144 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12
> 17130352 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13
> 17128544 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14
> 17128664 bytes took 154ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15
> 17128328 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16
> 17129288 bytes took 152ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17
> 17131056 bytes took 153ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18
> 17130344 bytes took 155ms
> /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19
> 17128432 bytes took 153ms
>
> I'm going to chunk the dirs to speed up the reads and throw it into a par
> iter.
>
> On Fri, 28 May 2021 at 09:09, Josh Taylor <joshuatayl...@gmail.com> wrote:
>
> > Hi!
> >
> > I've been using arrow/arrow-rs for a while now, my use case is to parse
> > Arrow streaming files and convert them into CSV.
> >
> > Rust has been an absolute fantastic tool for this, the performance is
> > outstanding and I have had no issues using it for my use case.
> >
> > I would be happy to test out the branch and let you know what the
> > performance is like, as I was going to improve the current implementation
> > that i have for the CSV writer, as it takes a while for bigger datasets
> > (multi-GB).
> >
> > Josh
> >
> >
> > On Thu, 27 May 2021 at 22:49, Jed Brown <j...@jedbrown.org> wrote:
> >
> >> Andy Grove <andygrov...@gmail.com> writes:
> >> >
> >> > Looking at this purely from the DataFusion/Ballista point of view,
> what
> >> I
> >> > would be interested in would be having a branch of DF that uses arrow2
> >> and
> >> > once that branch has all tests passing and can run queries with
> >> performance
> >> > that is at least as good as the original arrow crate, then cut over.
> >> >
> >> > However, for developers using the arrow APIs directly, I don't see an
> >> easy
> >> > path. We either try and gradually PR the changes in (which seems
> really
> >> > hard given that there are significant changes to APIs and internal
> data
> >> > structures) or we port some portion of the existing tests over to
> arrow2
> >> > and then make that the official crate once all test pass.
> >>
> >> How feasible would it be to make a legacy module in arrow2 that would
> >> enable (some large subset of) existing arrow users to try arrow2 after
> >> adjusting their use statements? (That is, implement the public-facing
> >> legacy interfaces in terms of arrow2's new, safe interface.) This would
> >> make it easier to test with DataFusion/Ballista and external users of
> the
> >> current arrow crate, then cut over and let those packages update
> >> incrementally from legacy to modern arrow2.
> >>
> >> I think it would be okay to tolerate some performance degradation when
> >> working through these legacy interfaces,so long as there was confidence
> >> that modernizing the callers would recover the performance (as tests
> have
> >> been showing).
> >>
> >
>

Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

Reply via email to