hi Kohei,

On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai <kai...@heterodb.com> wrote:
>
> Hello,
>
> I'm now trying to understand the Apache Arrow format for my application.
> Is there a format specification document including meta-data layout?
>
> I checked out the description at:
> https://github.com/apache/arrow/tree/master/docs/source/format
> https://github.com/apache/arrow/tree/master/format
>
> The format/IPC.rst says an arrow file has the format below:
>
> <magic number "ARROW1">
> <empty padding bytes [to 8 byte boundary]>
> <STREAMING FORMAT>
> <FOOTER>
> <FOOTER SIZE: int32>
> <magic number "ARROW1">
>
> Then, STREAMING FORMAT begins from SCHEMA-message.
> The message chunk has the format below:
>
> <metadata_size: int32>
> <metadata_flatbuffer: bytes>
> <padding>
> <message body>
>
> I made an arrow file using pyarrow [*1]. It has the following binary.
>
> [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
> 000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> 000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> 000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
> 000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> 000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> 000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
> 000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
> 000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
> 000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
> 0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
> 0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
> 0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
> 0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
> 0000e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
> 0000f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61
>
> The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
> that is reasonable for SCHEMA-message length.
> The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
> The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
> next 16bytes, expecting message body begins at 0x000020.

The Schema message has no message body -- it is all in the metadata
(i.e. as a Flatbuffer). Take a look at the the C++ implementation

* File preamble plus padding:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L946
* Write schema
   * from here 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L790
   * to here 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L939

The flatbuffer size 1420 is written here

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L954

followed by the Schema Flatbuffer message

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L957

followed by any padding bytes

Thus, the file layout should look like this

8 bytes: preamble
BEGIN MESSAGE 1 (Schema)
4 bytes: metadata size
1420 bytes: metadata (as described in Message.fbs) plus padding
0 bytes: body (Schema has no body)

I would be happy to clarify the specification document to make this
more clear if you can suggest some improvements.

- Wes

> However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
> (= RecordBatch, not Schema!), and the following 64bit is 0x0a000000000010(!).
> It is obviously I'm understanding incorrectly.
>
> Is there documentation stuff to introduce detailed layout of the arrow format?
>
> Thanks,
>
> [*1] Steps to make a sample arrow file
> $ python3.5
> >>> import pyarrow as pa
> >>> import pandas as pd
> >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", 
> >>> con="postgresql://localhost/postgres")
> >>> Y = pa.Table.from_pandas(X)
> >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
> >>> f.write_table(Y)
> >>> f.close()
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kai...@heterodb.com>

Reply via email to