Hi, Theoretically, it's defined there:
- https://arrow.apache.org/docs/ipc.html - https://arrow.apache.org/docs/metadata.html hth, -s sent from my droid On Fri, Jan 4, 2019, 02:15 Kohei KaiGai <kai...@heterodb.com wrote: > Hello, > > I'm now trying to understand the Apache Arrow format for my application. > Is there a format specification document including meta-data layout? > > I checked out the description at: > https://github.com/apache/arrow/tree/master/docs/source/format > https://github.com/apache/arrow/tree/master/format > > The format/IPC.rst says an arrow file has the format below: > > <magic number "ARROW1"> > <empty padding bytes [to 8 byte boundary]> > <STREAMING FORMAT> > <FOOTER> > <FOOTER SIZE: int32> > <magic number "ARROW1"> > > Then, STREAMING FORMAT begins from SCHEMA-message. > The message chunk has the format below: > > <metadata_size: int32> > <metadata_flatbuffer: bytes> > <padding> > <message body> > > I made an arrow file using pyarrow [*1]. It has the following binary. > > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16 > 000000 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 > 000010 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 > 000020 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > 000030 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 > 000040 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 > 000050 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 > 000060 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 > 000070 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 > 000080 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e > 000090 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 > 0000a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 > 0000b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 > 0000c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 > 0000d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c > 0000e0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 > 0000f0 22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61 > > The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420) > that is reasonable for SCHEMA-message length. > The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer. > The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip > next 16bytes, expecting message body begins at 0x000020. > However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03 > (= RecordBatch, not Schema!), and the following 64bit is > 0x0a000000000010(!). > It is obviously I'm understanding incorrectly. > > Is there documentation stuff to introduce detailed layout of the arrow > format? > > Thanks, > > [*1] Steps to make a sample arrow file > $ python3.5 > >>> import pyarrow as pa > >>> import pandas as pd > >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", > con="postgresql://localhost/postgres") > >>> Y = pa.Table.from_pandas(X) > >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema) > >>> f.write_table(Y) > >>> f.close() > > -- > HeteroDB, Inc / The PG-Strom Project > KaiGai Kohei <kai...@heterodb.com> >