Re: Format specification document?

Wes McKinney Sat, 05 Jan 2019 19:56:49 -0800

hi,

On Sat, Jan 5, 2019 at 8:44 PM Kohei KaiGai <[email protected]> wrote:
>
> Hello McKinney,
>
> After the post of my first message, I could find out a significant
> documentation here:
> https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example
>
> Unlike my expectation, flatbuffer mechanism has much different
> structure on-memory image.
> So, let's review the Apache Arrow file binary according to the 
> documentation...
>
> 000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> 000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
>
> The first 8bytes are signature of "ARROW1\0\0\0", then following
> 4bytes are length of
> the metadata regardless of the flatbuffer. Then, we could fetch


So the "metadata" here _is_ a Flatbuffer. Bytes 13 through 1432 are a
Flatbuffer message. There is no other data in the file representing
the schema.

> 0x0010(int) at 0x000c.
> It indicates 0x000c + 0x0010 is the root table.

I would not recommend parsing the protocol without the assistance of a
Flatbuffers implementation. It is the expectation that you will use
one of the libraries at https://github.com/google/flatbuffers or an
off-shoot project like flatcc to read the message. We make no
representations about the structure of the message beyond that it is
represented using the defined Flatbuffers schemas.

- Wes

>
> A int value at 0x001c is 0x000a. It means 0x001c - 0x000a = 0x0012 begins 
> vtable
> structure.
> 0x0012  0a 00  --> vtable length = 10bytes (5 items)
> 0x0014  0e 00  --> table length = 14 bytes; including the negative
> offset (4bytes)
> 0x0016  06 00  --> table 0x001c + 0x0006 is metadata version (short)
> 0x0018  05 00  --> table 0x001c + 0x0005 is message header (byte)
> 0x001a  08 00  --> table 0x001c + 0x0008 is header offset (int)
> 0x001c  0a 00 00 00  --> negative offset to the vtable
>
> So, we can know this file contains Apache Arrow V4 format, then header
> begins from
> at 0x0024 + 0x0010.
>
> 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> 000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
>
> Next, 0x0034 is position of the current table. It indicates 0x0034 -
> 0x000a is vtable.
>
> 0x002a  0a 00  --> vtable length = 10bytes (5items)
> 0x002c  0c 00  --> table length = 14bytes; including the negative
> offset (4bytes)
> 0x002e  00 00  --> Schema::endianness is default (0 = little endian)
> 0x0030  04 00  --> Schema::fields[]
> 0x0032  08 00  --> Schema::custom_metadata[]
>
> It says Schema::fields[] begins at 0x0038 + 0x03ec = 0x0424, and also says
> Schema::custom_metadata[] begins at 0x003a + 0x0004 = 0x0040.
>
> From 0x0040:
> 000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> 000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> 000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
> 000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
> 000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
> 000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
> 0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
> 0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
> 0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
> 0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
>
> The binary from 0x0060 is a cstring ("pandas\0"), and the binary from
> 0x006c is also a cstring of JSON.
>
> The location indicated by 0x0040 has number of vector element.
> So, this metadata contains one key-value pair.
> Next int word indicates the sub-table at 0x0050. Its vtable is below:
> 0x0048  08 00  --> vtable length = 8bytes (4items)
> 0x004a  0c 00  --> table length  = 12bytes; including the negative
> offset (4bytes)
> 0x004c  04 00  --> cstring offset (key) is at 0x0050 + 0x0004
> 0x004e  08 00  --> cstring offset (value) is at 0x0050 + 0x0008
>
> Key is at 0x0054 + 0x0008. Here is a int value: 0x0006. It means
> cstring length is
> 6bytes and the next byte (0x0060) begins the cstring body. ("pandas\0").
> Value is at 0x0058 + 0x0010. Here is a int value: 0x03b4 (= 948byes), then
> the next byte (0x006c) begins the cstring body. ("{pandas_version ... ).
>
>
> I didn't follow the entire data file, however, it makes me more clear.
> Best regards,
>
> 2019年1月6日(日) 8:50 Wes McKinney <[email protected]>:
> >
> > hi Kohei,
> >
> > On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I'm now trying to understand the Apache Arrow format for my application.
> > > Is there a format specification document including meta-data layout?
> > >
> > > I checked out the description at:
> > > https://github.com/apache/arrow/tree/master/docs/source/format
> > > https://github.com/apache/arrow/tree/master/format
> > >
> > > The format/IPC.rst says an arrow file has the format below:
> > >
> > > <magic number "ARROW1">
> > > <empty padding bytes [to 8 byte boundary]>
> > > <STREAMING FORMAT>
> > > <FOOTER>
> > > <FOOTER SIZE: int32>
> > > <magic number "ARROW1">
> > >
> > > Then, STREAMING FORMAT begins from SCHEMA-message.
> > > The message chunk has the format below:
> > >
> > > <metadata_size: int32>
> > > <metadata_flatbuffer: bytes>
> > > <padding>
> > > <message body>
> > >
> > > I made an arrow file using pyarrow [*1]. It has the following binary.
> > >
> > > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
> > > 000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> > > 000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> > > 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> > > 000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
> > > 000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> > > 000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> > > 000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
> > > 000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
> > > 000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
> > > 000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
> > > 0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
> > > 0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
> > > 0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
> > > 0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
> > > 0000e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
> > > 0000f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61
> > >
> > > The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
> > > that is reasonable for SCHEMA-message length.
> > > The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
> > > The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
> > > next 16bytes, expecting message body begins at 0x000020.
> >
> > The Schema message has no message body -- it is all in the metadata
> > (i.e. as a Flatbuffer). Take a look at the the C++ implementation
> >
> > * File preamble plus padding:
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L946
> > * Write schema
> >    * from here 
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L790
> >    * to here 
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L939
> >
> > The flatbuffer size 1420 is written here
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L954
> >
> > followed by the Schema Flatbuffer message
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L957
> >
> > followed by any padding bytes
> >
> > Thus, the file layout should look like this
> >
> > 8 bytes: preamble
> > BEGIN MESSAGE 1 (Schema)
> > 4 bytes: metadata size
> > 1420 bytes: metadata (as described in Message.fbs) plus padding
> > 0 bytes: body (Schema has no body)
> >
> > I would be happy to clarify the specification document to make this
> > more clear if you can suggest some improvements.
> >
> > - Wes
> >
> > > However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
> > > (= RecordBatch, not Schema!), and the following 64bit is 
> > > 0x0a000000000010(!).
> > > It is obviously I'm understanding incorrectly.
> > >
> > > Is there documentation stuff to introduce detailed layout of the arrow 
> > > format?
> > >
> > > Thanks,
> > >
> > > [*1] Steps to make a sample arrow file
> > > $ python3.5
> > > >>> import pyarrow as pa
> > > >>> import pandas as pd
> > > >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", 
> > > >>> con="postgresql://localhost/postgres")
> > > >>> Y = pa.Table.from_pandas(X)
> > > >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
> > > >>> f.write_table(Y)
> > > >>> f.close()
> > >
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei <[email protected]>
>
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <[email protected]>

Re: Format specification document?

Reply via email to