hi folks, I received the following feedback from the author of FlatCC (Flatbuffers in C) about the Arrow columnar format spec some months ago and he gave me permission to post on the mailing list. There may be some refinements we may wish to make and so I hope this leads to some fruitful discussions and follow-up JIRAs for further analysis.
Thanks, Wes --------------------------- 1. The IPC stream header has a size prefix. The FlatBuffer will not be properly aligned after the size field if this is added naively. FlatBuffers has an option to add a properly aligned size prefix but the end of the buffer might not be aligned to 8 bytes. I know it isn’t in FlatCC, currently, at least not trivially. I think C++ pads the buffer up to the buffers natural alignment, but that might only be 4 bytes. If so, the length will be too short. You can modify the size manually after the fact to fix this. I think FlatBuffers might need a pad operation to support your use case, but the doc could at least point to this issue - make sure to use the size prefix option, and adjust length subsequently if extra padding is required. 2. The spec is vague about optionally leaving out nullable bitmaps, but all data types are now nullable. Thus, Utf8 type appears to always omit the nullable buffer, while other types might have the bitmap length 0. This creates a problem because you need to know how many buffers to skip with going through a record batch. You MUST know implicitly if a nullable buffer is present or not, and this isn’t documented. One solution is to define that a nullable buffer MUST be absent if the nullable length is 0, another option is to require the buffer to always be present, even if it is null, and even if it is Utf8. The documentation uses list<char> for strings some places, an this would imply a nullable bitmap, and other places Utf8 is used. 3. Struct_ is not very nice. Are you sure that Struct is keyword, as opposed to lower case struct? It is particularly bad because the name will appear in JSON representation. This ought to be fixed somehow. 4. I understand the rationale behind using integers instead of unsigned integers in many places. However, I think it is a mistake to not have unsigned integers in the Schema.fbs type because this makes it impossible to carry unsigned integers in their native form within Arrow. I’m fine with stating that support is optional and should not be the default and all, but I think it should be possible. It could also be a signed : bool = true flag in the Field table. 5. I’m not really convinced myself, but the timestamp type does not allow for a time zone except for metadata. I have a very concrete use case where this would not work: You need to know if someone visited a web site during business hours and you collect data over multiple timezones, so you need the timezone of the source IP. You may perhaps argue that this information could be stored in a separate column as a time type, or a tz string, and I would not disagree - which is why I am not convinced. However, this ties two separate columns closely together where you might want a consolidated view of the time. However, it is much worse because a time zone does not account for daylight saving etc. so you need a full Olson tz to be sure, so it could be a mess to try to support. Another option is to have timezones associated with a batch which is something I have considered in a another context. It would not always work - e.g. when the zone keeps change for every row, but it is interesting in terms of being expressive and compact, and you might want to cluster data by time zone even if not appearing so originally. 6. A file header might actually exist in a stream. You might want the index of the file footer while having the ability to have aligned content of the stream type. I think files should try harder to be just a more elaborate stream type. 7. For interprocess communication, it may be relevant to look at the netmap API and possibly vale (thought it might be a bit overengineered). It is not just for fast networking but also for fast IPC. The basic model is very well suited for processing either Arrow pages, or multiple IPC messages. http://info.iet.unipi.it/~luigi/netmap/ http://info.iet.unipi.it/~luigi/vale/ https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 8. I am sort of active in the development of the QUIC protocol, and I think there is a huge synergy between Arrrow Stream IPC and QUIC. QUIC has multiple async streams. Each batch record could trivially be sent in a separate QUIC stream. The stream knows its own length, so you can skip some header info. This means you can write to individual columns below batch level because you have a stream per column. You can have a separate stream to record batches. There is race condition because a batch may arrive before the dependent data - a problem also known from header compression in QUIC/HTTP, but not something you cannot manage - it only affects how much you have to block. In conclusion, there is an obvious use case for Arrow over QUIC. It could be embedded in QUIC/HTTP but I think a dedicated application protocol would be better. Next question is how you deal with multiple IPC messages on the same connection - which is perfectly possible but needs a little coordination.