Hey Wes,

I appreciate your comments and want to be clear that I am not blocking this
addition.

Memory mapping itself is not in conflict with my comments. However, since
Arrow datasets do not exist frequently on disk today, a user can make
choices as to whether to use smaller or larger batches. When that choice is
present, I have a hard time seeing situations where there are real
disadvantages of having 1000 batches of a billion records each versus one
batch of a trillion records.

I understand the argument and need for LargeBinary data in general.
However, in those situations I'm not clear what benefits are provided by a
columnar representation of data where data is laid out end to end. At that
point, you're probably much better off just storing the items individually
and using some form of indirection/addressing from an Arrow structure to
independent large objects.

This all comes down to how much Arrow needs to be all things to all people.
I don't argue that there are use cases for this stuff. I just wonder how
much any of the structural elements of Arrow benefit such use cases (beyond
a nice set of libraries to work with).

Personally I would rather address the 64-bit offset issue now so that I
> stop hearing the objection from would-be users (I can count a dozen or so
> occasions where I've been accosted in person over this issue at conferences
> and elsewhere). It would be a good idea to recommend a preference for
> 32-bit offsets in our documentation.
>

I wish we could understand how much of this is people trying to force-fit
Arrow into preconceived notions versus true needs that are complementary to
the ones the Arrow community benefits from today. (I know this will just
continue to be a wish.) I know that as an engineer I am great at pointing
out potential constraints in technologies I haven't yet used or fully
understood. I wonder how many others have the same failing when looking at
Arrow :D

J

Reply via email to