Hi Jacob, I think translation between Arrow and Protobuf could be useful. Given your use-case I'd suggest considering a few things:
1. If the end-goal is to work with Parquet then you might consider building the layer directly on top of the Parquet low-level API instead of involving Arrow. For sparsely populated nested messages, you will avoid unneeded construction and destruction (I think parquet-MR also has some optimizations to avoid cache churn in these situations). Keeping data in row-form also adds the option of buffering and sorting efficiency, which can greatly improve storage and metadata pruning efficiency. Working directly with the parquet API would also reduce the metadata you would have to plumb through if you wanted to do schema resolution using protobuf tag numbers (instead of field names). 2. If the data you are working with isn't mutable (i.e. Append only), it might be worth experimenting with code-gen that wraps Arrow Builder classes with an idiomatic expressive API instead of adding an extra level of indirection. 3. If we do pursue the APIs above. I think APIs like this are likely more naturally written in terms of RecordBatch or RecordBatchReader instead of Tables. 4. Defining the scope of protobuf handling will be important. Protobuf extensions [1] have some sharp edges to incorporate. How "oneof" fields are handled would also be something to consider. I'm also a bit curious to see if arrow allows faster deserialization when > compared to a > list of serialized protos on disk It depends what you mean by this. If you want to deserialize from Parquet back to proto, I would be pretty surprised (but not completely) if going through Arrow is more efficient, especially for deeply nested "sparse" messages. The protobuf reflection implementation has a high overhead of setting individual fields, and as mentioned above nested sparse messages probably incur some level of tax that is avoided with serialized protobufs. [1] https://developers.google.com/protocol-buffers/docs/proto#extensions Cheers, Micah On Mon, Jan 3, 2022 at 1:27 PM Jacob Huffman <jacobshuff...@gmail.com> wrote: > Hey all, > > Is there much interest in adding the capability to do Arrow <=> Protobuf > conversion in C++? > > I'm working on this for a side project, but I was wondering if there is > much interest from the broader Arrow community. If so, I might be able to > find time to contribute it. > > To get the point across, here is a strawman API. In reality, we would > likely need some sort of builder API which allows incrementally adding > protos and a generator-like API for returning the protos from a table. > > """ > // Functions of functions using templates to work with any message type > template <class T> > Result<std::shared_prt<Table>> ProtosToTable(const std::vector<T>& protos); > > template <class T> > Result<std::vector<T>> TableToProtos(const std::shared_prt<Table> table); > > // Pair of functions using google::protobuf::Message and polymorphism to > work with any message type > Result<std::shared_prt<Table>> ProtosToTable( const > std::vector<google::protobuf::Message *>& protos); > > // I don't like that this returns a vector of unique pointers. Is there a > better way to return a vector of base classes while retaining polymorphic > behavior? > Result<std::vector<std::unique_ptr<google::protobuf::Message>>> > TableToProtos (const std::shared_prt<Table> table, const > google::protobuf:Descriptor* descriptor); > """ > > My particular use case for these functions is that I would like to use > protobufs for the in-memory data representation as it provides strongly > typed classes which are very expressive (can have nested/repeated fields) > and a well established path for schema evolution. However, I would like to > use parquet as the data storage layer (and arrow as the glue between the > two) so that I can take advantage of technologies like presto for querying > the data. I'm hoping that backwards compatible changes to the proto schema > turn into backwards compatible changes in the parquet files. I'm also a bit > curious to see if arrow allows faster deserialization when compared to a > list of serialized protos on disk. >