unsubscribe
On Sat, Jan 5, 2019 at 6:44 PM Kohei KaiGai <kai...@heterodb.com> wrote: > Hello McKinney, > > After the post of my first message, I could find out a significant > documentation here: > > https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example > > Unlike my expectation, flatbuffer mechanism has much different > structure on-memory image. > So, let's review the Apache Arrow file binary according to the > documentation... > > 000000 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 > 000010 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 > 000020 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > > The first 8bytes are signature of "ARROW1\0\0\0", then following > 4bytes are length of > the metadata regardless of the flatbuffer. Then, we could fetch > 0x0010(int) at 0x000c. > It indicates 0x000c + 0x0010 is the root table. > > A int value at 0x001c is 0x000a. It means 0x001c - 0x000a = 0x0012 begins > vtable > structure. > 0x0012 0a 00 --> vtable length = 10bytes (5 items) > 0x0014 0e 00 --> table length = 14 bytes; including the negative > offset (4bytes) > 0x0016 06 00 --> table 0x001c + 0x0006 is metadata version (short) > 0x0018 05 00 --> table 0x001c + 0x0005 is message header (byte) > 0x001a 08 00 --> table 0x001c + 0x0008 is header offset (int) > 0x001c 0a 00 00 00 --> negative offset to the vtable > > So, we can know this file contains Apache Arrow V4 format, then header > begins from > at 0x0024 + 0x0010. > > 000020 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > 000030 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 > > Next, 0x0034 is position of the current table. It indicates 0x0034 - > 0x000a is vtable. > > 0x002a 0a 00 --> vtable length = 10bytes (5items) > 0x002c 0c 00 --> table length = 14bytes; including the negative > offset (4bytes) > 0x002e 00 00 --> Schema::endianness is default (0 = little endian) > 0x0030 04 00 --> Schema::fields[] > 0x0032 08 00 --> Schema::custom_metadata[] > > It says Schema::fields[] begins at 0x0038 + 0x03ec = 0x0424, and also says > Schema::custom_metadata[] begins at 0x003a + 0x0004 = 0x0040. > > From 0x0040: > 000040 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 > 000050 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 > 000060 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 > 000070 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 > 000080 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e > 000090 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 > 0000a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 > 0000b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 > 0000c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 > 0000d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c > > The binary from 0x0060 is a cstring ("pandas\0"), and the binary from > 0x006c is also a cstring of JSON. > > The location indicated by 0x0040 has number of vector element. > So, this metadata contains one key-value pair. > Next int word indicates the sub-table at 0x0050. Its vtable is below: > 0x0048 08 00 --> vtable length = 8bytes (4items) > 0x004a 0c 00 --> table length = 12bytes; including the negative > offset (4bytes) > 0x004c 04 00 --> cstring offset (key) is at 0x0050 + 0x0004 > 0x004e 08 00 --> cstring offset (value) is at 0x0050 + 0x0008 > > Key is at 0x0054 + 0x0008. Here is a int value: 0x0006. It means > cstring length is > 6bytes and the next byte (0x0060) begins the cstring body. ("pandas\0"). > Value is at 0x0058 + 0x0010. Here is a int value: 0x03b4 (= 948byes), then > the next byte (0x006c) begins the cstring body. ("{pandas_version ... ). > > > I didn't follow the entire data file, however, it makes me more clear. > Best regards, > > 2019年1月6日(日) 8:50 Wes McKinney <wesmck...@gmail.com>: > > > > hi Kohei, > > > > On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai <kai...@heterodb.com> wrote: > > > > > > Hello, > > > > > > I'm now trying to understand the Apache Arrow format for my > application. > > > Is there a format specification document including meta-data layout? > > > > > > I checked out the description at: > > > https://github.com/apache/arrow/tree/master/docs/source/format > > > https://github.com/apache/arrow/tree/master/format > > > > > > The format/IPC.rst says an arrow file has the format below: > > > > > > <magic number "ARROW1"> > > > <empty padding bytes [to 8 byte boundary]> > > > <STREAMING FORMAT> > > > <FOOTER> > > > <FOOTER SIZE: int32> > > > <magic number "ARROW1"> > > > > > > Then, STREAMING FORMAT begins from SCHEMA-message. > > > The message chunk has the format below: > > > > > > <metadata_size: int32> > > > <metadata_flatbuffer: bytes> > > > <padding> > > > <message body> > > > > > > I made an arrow file using pyarrow [*1]. It has the following binary. > > > > > > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16 > > > 000000 41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00 > > > 000010 00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00 > > > 000020 00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00 > > > 000030 04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00 > > > 000040 01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00 > > > 000050 08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00 > > > 000060 70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61 > > > 000070 6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22 > > > 000080 30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e > > > 000090 73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22 > > > 0000a0 3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74 > > > 0000b0 79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22 > > > 0000c0 6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69 > > > 0000d0 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c > > > 0000e0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 > > > 0000f0 22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61 > > > > > > The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420) > > > that is reasonable for SCHEMA-message length. > > > The next 32bit is 0x0010 (=16). It may be metadata_size of the > FlatBuffer. > > > The IPC.rst does not mention about FlatBuffer metadata, so I tried to > skip > > > next 16bytes, expecting message body begins at 0x000020. > > > > The Schema message has no message body -- it is all in the metadata > > (i.e. as a Flatbuffer). Take a look at the the C++ implementation > > > > * File preamble plus padding: > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L946 > > * Write schema > > * from here > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L790 > > * to here > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L939 > > > > The flatbuffer size 1420 is written here > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L954 > > > > followed by the Schema Flatbuffer message > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L957 > > > > followed by any padding bytes > > > > Thus, the file layout should look like this > > > > 8 bytes: preamble > > BEGIN MESSAGE 1 (Schema) > > 4 bytes: metadata size > > 1420 bytes: metadata (as described in Message.fbs) plus padding > > 0 bytes: body (Schema has no body) > > > > I would be happy to clarify the specification document to make this > > more clear if you can suggest some improvements. > > > > - Wes > > > > > However, the first 16bit (version) is 0x0001 (=V2), the next byte is > 0x03 > > > (= RecordBatch, not Schema!), and the following 64bit is > 0x0a000000000010(!). > > > It is obviously I'm understanding incorrectly. > > > > > > Is there documentation stuff to introduce detailed layout of the arrow > format? > > > > > > Thanks, > > > > > > [*1] Steps to make a sample arrow file > > > $ python3.5 > > > >>> import pyarrow as pa > > > >>> import pandas as pd > > > >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", > con="postgresql://localhost/postgres") > > > >>> Y = pa.Table.from_pandas(X) > > > >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema) > > > >>> f.write_table(Y) > > > >>> f.close() > > > > > > -- > > > HeteroDB, Inc / The PG-Strom Project > > > KaiGai Kohei <kai...@heterodb.com> > > > > -- > HeteroDB, Inc / The PG-Strom Project > KaiGai Kohei <kai...@heterodb.com> >