hi Kohei, On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai <kai...@heterodb.com> wrote: > > Hello, > > I'm now trying to understand the Apache Arrow format for my application. > Is there a format specification document including meta-data layout? > > I checked out the description at: > https://github.com/apache/arrow/tree/master/docs/source/format > https://github.com/apache/arrow/tree/master/format > > The format/IPC.rst says an arrow file has the format below: > > <magic number "ARROW1"> > <empty padding bytes [to 8 byte boundary]> > <STREAMING FORMAT> > <FOOTER> > <FOOTER SIZE: int32> > <magic number "ARROW1"> > > Then, STREAMING FORMAT begins from SCHEMA-message. > The message chunk has the format below: > > <metadata_size: int32> > <metadata_flatbuffer: bytes> > <padding> > <message body> > > I made an arrow file using pyarrow [*1]. It has the following binary. > > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16 > 000000 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 > 000010 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 > 000020 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > 000030 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 > 000040 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 > 000050 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 > 000060 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 > 000070 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 > 000080 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e > 000090 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 > 0000a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 > 0000b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 > 0000c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 > 0000d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c > 0000e0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 > 0000f0 22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61 > > The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420) > that is reasonable for SCHEMA-message length. > The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer. > The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip > next 16bytes, expecting message body begins at 0x000020.
The Schema message has no message body -- it is all in the metadata (i.e. as a Flatbuffer). Take a look at the the C++ implementation * File preamble plus padding: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L946 * Write schema * from here https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L790 * to here https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L939 The flatbuffer size 1420 is written here https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L954 followed by the Schema Flatbuffer message https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L957 followed by any padding bytes Thus, the file layout should look like this 8 bytes: preamble BEGIN MESSAGE 1 (Schema) 4 bytes: metadata size 1420 bytes: metadata (as described in Message.fbs) plus padding 0 bytes: body (Schema has no body) I would be happy to clarify the specification document to make this more clear if you can suggest some improvements. - Wes > However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03 > (= RecordBatch, not Schema!), and the following 64bit is 0x0a000000000010(!). > It is obviously I'm understanding incorrectly. > > Is there documentation stuff to introduce detailed layout of the arrow format? > > Thanks, > > [*1] Steps to make a sample arrow file > $ python3.5 > >>> import pyarrow as pa > >>> import pandas as pd > >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", > >>> con="postgresql://localhost/postgres") > >>> Y = pa.Table.from_pandas(X) > >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema) > >>> f.write_table(Y) > >>> f.close() > > -- > HeteroDB, Inc / The PG-Strom Project > KaiGai Kohei <kai...@heterodb.com>