Hi,

Theoretically, it's defined there:

- https://arrow.apache.org/docs/ipc.html
- https://arrow.apache.org/docs/metadata.html

hth,
-s


sent from my droid


On Fri, Jan 4, 2019, 02:15 Kohei KaiGai <kai...@heterodb.com wrote:

> Hello,
>
> I'm now trying to understand the Apache Arrow format for my application.
> Is there a format specification document including meta-data layout?
>
> I checked out the description at:
> https://github.com/apache/arrow/tree/master/docs/source/format
> https://github.com/apache/arrow/tree/master/format
>
> The format/IPC.rst says an arrow file has the format below:
>
> <magic number "ARROW1">
> <empty padding bytes [to 8 byte boundary]>
> <STREAMING FORMAT>
> <FOOTER>
> <FOOTER SIZE: int32>
> <magic number "ARROW1">
>
> Then, STREAMING FORMAT begins from SCHEMA-message.
> The message chunk has the format below:
>
> <metadata_size: int32>
> <metadata_flatbuffer: bytes>
> <padding>
> <message body>
>
> I made an arrow file using pyarrow [*1]. It has the following binary.
>
> [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
> 000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> 000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> 000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
> 000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> 000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> 000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
> 000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
> 000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
> 000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
> 0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
> 0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
> 0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
> 0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
> 0000e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
> 0000f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61
>
> The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
> that is reasonable for SCHEMA-message length.
> The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
> The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
> next 16bytes, expecting message body begins at 0x000020.
> However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
> (= RecordBatch, not Schema!), and the following 64bit is
> 0x0a000000000010(!).
> It is obviously I'm understanding incorrectly.
>
> Is there documentation stuff to introduce detailed layout of the arrow
> format?
>
> Thanks,
>
> [*1] Steps to make a sample arrow file
> $ python3.5
> >>> import pyarrow as pa
> >>> import pandas as pd
> >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000",
> con="postgresql://localhost/postgres")
> >>> Y = pa.Table.from_pandas(X)
> >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
> >>> f.write_table(Y)
> >>> f.close()
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kai...@heterodb.com>
>

Reply via email to