Some feedback on the Arrow columnar format

Wes McKinney Fri, 29 Jun 2018 05:13:52 -0700

hi folks,

I received the following feedback from the author of FlatCC
(Flatbuffers in C) about the Arrow columnar format spec some months
ago and he gave me permission to post on the mailing list. There may
be some refinements we may wish to make and so I hope this leads to
some fruitful discussions and follow-up JIRAs for further analysis.


Thanks,
Wes

---------------------------

1. The IPC stream header has a size prefix. The FlatBuffer will
not be properly aligned after the size field if this is added
naively. FlatBuffers has an option to add a properly aligned size
prefix but the end of the buffer might not be aligned to 8
bytes. I know it isn’t in FlatCC, currently, at least not
trivially. I think C++ pads the buffer up to the buffers natural
alignment, but that might only be 4 bytes. If so, the length will
be too short. You can modify the size manually after the fact to
fix this.

I think FlatBuffers might need a pad operation to support your
use case, but the doc could at least point to this issue - make
sure to use the size prefix option, and adjust length
subsequently if extra padding is required.

2. The spec is vague about optionally leaving out nullable
bitmaps, but all data types are now nullable. Thus, Utf8 type
appears to always omit the nullable buffer, while other types
might have the bitmap length 0. This creates a problem because
you need to know how many buffers to skip with going through a
record batch. You MUST know implicitly if a nullable buffer is
present or not, and this isn’t documented. One solution is to
define that a nullable buffer MUST be absent if the nullable
length is 0, another option is to require the buffer to always be
present, even if it is null, and even if it is Utf8.

The documentation uses list<char> for strings some places, an
this would imply a nullable bitmap, and other places Utf8 is
used.

3. Struct_ is not very nice. Are you sure that Struct is keyword, as opposed to
lower case struct? It is particularly bad because the name will appear in JSON
representation. This ought to be fixed somehow.

4. I understand the rationale behind using integers instead of unsigned
integers in many places. However, I think it is a mistake to not have unsigned
integers in the Schema.fbs type because this makes it impossible to carry
unsigned integers in their native form within Arrow. I’m fine with stating that
support is optional and should not be the default and all, but I think it
should be possible. It could also be a signed : bool = true flag in the Field
table.

5. I’m not really convinced myself, but the timestamp type does not allow for a
time zone except for metadata. I have a very concrete use case where this would
not work: You need to know if someone visited a web site during business hours
and you collect data over multiple timezones, so you need the timezone of the
source IP. You may perhaps argue that this information could be stored in a
separate column as a time type, or a tz string, and I would not disagree -
which is why I am not convinced. However, this ties two separate columns
closely together where you might want a consolidated view of the time. However,
it is much worse because a time zone does not account for daylight saving
etc. so you need a full Olson tz to be sure, so it could be a mess to try to
support.

Another option is to have timezones associated with a batch which is something
I have considered in a another context. It would not always work - e.g. when
the zone keeps change for every row, but it is interesting in terms of being
expressive and compact, and you might want to cluster data by time zone even if
not appearing so originally.

6. A file header might actually exist in a stream. You might want the index of
the file footer while having the ability to have aligned content of the stream
type. I think files should try harder to be just a more elaborate stream type.

7. For interprocess communication, it may be relevant to look at the netmap API
and possibly vale (thought it might be a bit overengineered). It is not just
for fast networking but also for fast IPC. The basic model is very well suited
for processing either Arrow pages, or multiple IPC messages.


http://info.iet.unipi.it/~luigi/netmap/
http://info.iet.unipi.it/~luigi/vale/
https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4

8. I am sort of active in the development of the QUIC protocol, and I think
there is a huge synergy between Arrrow Stream IPC and QUIC. QUIC has multiple
async streams. Each batch record could trivially be sent in a separate QUIC
stream. The stream knows its own length, so you can skip some header info. This
means you can write to individual columns below batch level because you have a
stream per column. You can have a separate stream to record batches. There is
race condition because a batch may arrive before the dependent data - a problem
also known from header compression in QUIC/HTTP, but not something you cannot
manage - it only affects how much you have to block. In conclusion, there is an
obvious use case for Arrow over QUIC. It could be embedded in QUIC/HTTP but I
think a dedicated application protocol would be better. Next question is how
you deal with multiple IPC messages on the same connection - which is perfectly
possible but needs a little coordination.

Some feedback on the Arrow columnar format

Reply via email to