[arrow] branch master updated: ARROW-3473: [Format] Clarify that 64-bit lengths and null counts are permitted, but not recommended

wesm Wed, 10 Oct 2018 04:59:16 -0700

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/master by this push:
     new 862c7df  ARROW-3473: [Format] Clarify that 64-bit lengths and null 
counts are permitted, but not recommended
862c7df is described below

commit 862c7df5925769e7f1e84d6d7b990d5d358056bb
Author: Wes McKinney <[email protected]>
AuthorDate: Wed Oct 10 07:58:48 2018 -0400

    ARROW-3473: [Format] Clarify that 64-bit lengths and null counts are 
permitted, but not recommended
    
    The Arrow metadata was changed from the initial format specification to 
permit 64-bit array lengths, support for which is provided by the C++ library. 
This clarifies that 64-bit lengths are permissible, but for best compatibility 
(e.g. with Java), it is recommended to use 32-bits or less in practice.
    
    see also #2733
    
    Author: Wes McKinney <[email protected]>
    
    Closes #2734 from wesm/ARROW-3473 and squashes the following commits:
    
    11b4b08a9 <Wes McKinney> Add note about representing large data sets using 
chunks
    067ec9932 <Wes McKinney> Clarify that 64-bit lengths and null counts are 
permitted, but not recommended for best inter-language compatibility
---
 format/Layout.md | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/format/Layout.md b/format/Layout.md
index fec2322..80af1d3 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -135,21 +135,19 @@ Unless otherwise noted, padded bytes do not need to have 
a specific value.
 
 ## Array lengths
 
-Any array has a known and fixed length, stored as a 32-bit signed integer, so a
-maximum of 2<sup>31</sup> - 1 elements. We choose a signed int32 for a couple 
reasons:
-
-* Enhance compatibility with Java and client languages which may have varying
-  quality of support for unsigned integers.
-* To encourage developers to compose smaller arrays (each of which contains
-  contiguous memory in its leaf nodes) to create larger array structures
-  possibly exceeding 2<sup>31</sup> - 1 elements, as opposed to allocating 
very large
-  contiguous memory blocks.
+Array lengths are represented in the Arrow metadata as a 64-bit signed
+integer. An implementation of Arrow is considered valid even if it only
+supports lengths up to the maximum 32-bit signed integer, though. If using
+Arrow in a multi-language environment, we recommend limiting lengths to
+2<sup>31</sup> - 1 elements or less. Larger data sets can be represented using
+multiple array chunks.
 
 ## Null count
 
 The number of null value slots is a property of the physical array and
-considered part of the data structure. The null count is stored as a 32-bit
-signed integer, as it may be as large as the array length.
+considered part of the data structure. The null count is represented in the
+Arrow metadata as a 64-bit signed integer, as it may be as large as the array
+length.
 
 ## Null bitmaps
 
@@ -614,10 +612,14 @@ the types array indicates that a slot contains a 
different type at the index.
 
 ## Dictionary encoding
 
-When a field is dictionary encoded, the values are represented by an array of 
Int32 representing the index of the value in the dictionary.
-The Dictionary is received as one or more DictionaryBatches with the id 
referenced by a dictionary attribute defined in the metadata ([Message.fbs][7]) 
in the Field table.
-The dictionary has the same layout as the type of the field would dictate. 
Each entry in the dictionary can be accessed by its index in the 
DictionaryBatches.
-When a Schema references a Dictionary id, it must send at least one 
DictionaryBatch for this id.
+When a field is dictionary encoded, the values are represented by an array of
+Int32 representing the index of the value in the dictionary.  The Dictionary is
+received as one or more DictionaryBatches with the id referenced by a
+dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field
+table.  The dictionary has the same layout as the type of the field would
+dictate. Each entry in the dictionary can be accessed by its index in the
+DictionaryBatches.  When a Schema references a Dictionary id, it must send at
+least one DictionaryBatch for this id.
 
 As an example, you could have the following data:
 ```

[arrow] branch master updated: ARROW-3473: [Format] Clarify that 64-bit lengths and null counts are permitted, but not recommended

Reply via email to