This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 862c7df ARROW-3473: [Format] Clarify that 64-bit lengths and null
counts are permitted, but not recommended
862c7df is described below
commit 862c7df5925769e7f1e84d6d7b990d5d358056bb
Author: Wes McKinney <[email protected]>
AuthorDate: Wed Oct 10 07:58:48 2018 -0400
ARROW-3473: [Format] Clarify that 64-bit lengths and null counts are
permitted, but not recommended
The Arrow metadata was changed from the initial format specification to
permit 64-bit array lengths, support for which is provided by the C++ library.
This clarifies that 64-bit lengths are permissible, but for best compatibility
(e.g. with Java), it is recommended to use 32-bits or less in practice.
see also #2733
Author: Wes McKinney <[email protected]>
Closes #2734 from wesm/ARROW-3473 and squashes the following commits:
11b4b08a9 <Wes McKinney> Add note about representing large data sets using
chunks
067ec9932 <Wes McKinney> Clarify that 64-bit lengths and null counts are
permitted, but not recommended for best inter-language compatibility
---
format/Layout.md | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/format/Layout.md b/format/Layout.md
index fec2322..80af1d3 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -135,21 +135,19 @@ Unless otherwise noted, padded bytes do not need to have
a specific value.
## Array lengths
-Any array has a known and fixed length, stored as a 32-bit signed integer, so a
-maximum of 2<sup>31</sup> - 1 elements. We choose a signed int32 for a couple
reasons:
-
-* Enhance compatibility with Java and client languages which may have varying
- quality of support for unsigned integers.
-* To encourage developers to compose smaller arrays (each of which contains
- contiguous memory in its leaf nodes) to create larger array structures
- possibly exceeding 2<sup>31</sup> - 1 elements, as opposed to allocating
very large
- contiguous memory blocks.
+Array lengths are represented in the Arrow metadata as a 64-bit signed
+integer. An implementation of Arrow is considered valid even if it only
+supports lengths up to the maximum 32-bit signed integer, though. If using
+Arrow in a multi-language environment, we recommend limiting lengths to
+2<sup>31</sup> - 1 elements or less. Larger data sets can be represented using
+multiple array chunks.
## Null count
The number of null value slots is a property of the physical array and
-considered part of the data structure. The null count is stored as a 32-bit
-signed integer, as it may be as large as the array length.
+considered part of the data structure. The null count is represented in the
+Arrow metadata as a 64-bit signed integer, as it may be as large as the array
+length.
## Null bitmaps
@@ -614,10 +612,14 @@ the types array indicates that a slot contains a
different type at the index.
## Dictionary encoding
-When a field is dictionary encoded, the values are represented by an array of
Int32 representing the index of the value in the dictionary.
-The Dictionary is received as one or more DictionaryBatches with the id
referenced by a dictionary attribute defined in the metadata ([Message.fbs][7])
in the Field table.
-The dictionary has the same layout as the type of the field would dictate.
Each entry in the dictionary can be accessed by its index in the
DictionaryBatches.
-When a Schema references a Dictionary id, it must send at least one
DictionaryBatch for this id.
+When a field is dictionary encoded, the values are represented by an array of
+Int32 representing the index of the value in the dictionary. The Dictionary is
+received as one or more DictionaryBatches with the id referenced by a
+dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field
+table. The dictionary has the same layout as the type of the field would
+dictate. Each entry in the dictionary can be accessed by its index in the
+DictionaryBatches. When a Schema references a Dictionary id, it must send at
+least one DictionaryBatch for this id.
As an example, you could have the following data:
```