Re: Some questions/proposals for the spec (Layout.md)

Micah Kornfield Fri, 08 Apr 2016 00:05:09 -0700

Thanks for the quick and thorough reply.  Snipping out some segments
for follow-up (everything else sounds reasonable):


>>
>> 2.  The document specifies null bitmaps should have a length padded to
>> a multiple of 8 bytes to avoid word alignment concerns.   I assume the
>> concern is having a subsequent value buffer that isn't properly
>> aligned.  Are there other concerns that make the padding worthwhile?
>> If not, it might pay to define the term "contiguous memory buffer"
>> (already mentioned in the document) as a "contiguous memory region"
>> that starts at an address that is a multiple of 8 and not require the
>> padding (both values and bitmaps are stored in their own contiguous
>> memory buffers).   This would ensure that odd-length primitive byte
>> Arrays won't mess up alignment either.


> I believe this choice was primarily about simplifying the code (similar to 
> why we have a n+1
> offsets instead of just n in the list/varchar representations (even though 
> n=0 is always 0)). In both
> situations, you don't have to worry about writing special code (and a 
> condition) for the boundary
> condition inside tight loops (e.g. the last few bytes need to be handled 
> differently since they
> aren't word width).

Sounds reasonable.  It might be worth illustrating this with a
concrete example.  One scenario that this scheme seems useful for is a
creating a new bitmap based on evaluating a predicate (i.e. all
elements >X).  In this case would it make sense to make it a multiple
of 16, so we can consistently use SIMD instructions for the logical
"and" operation?

Related to the size of unions:
> The original discussion was whether we really need more than a single byte. 
> Part of this has
> todo with whether the system you're implementing allows multiple of the same 
> type in the union.
> My personal preference is that unions should only exist where node type 
> deviates (more like
> JSON than Avro) which is why we have previously used one byte in the java 
> implementation (we
> actually only need 20 some options maximum if you only focus on base types). 
> I understand that > others think about the world differently and like the 
> concept of a union of structs and thus
> relented to move to 2 bytes. Any more than that seems like overkill. (Are you 
> really seeing a
> union of >64k structs frequently? In other words, I vote for waiting until 
> there is substantial
> common demand before expanding.

I think the spec is slightly inconsistent.  It says there is 6 bytes
of overhead per entry but then follows: "with the smallest byte width
capable of representing the number of types in the union."  I'm
perfectly happy to say it is always 1, always 2, or always capped at
2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
to clear up the ambiguity.

-Micah

On Thu, Apr 7, 2016 at 11:07 PM, Jacques Nadeau <jacq...@apache.org> wrote:
> Inline...
>
>>
>> 1.  For completeness it might be useful to add a statement that the
>> byte order (endianness) is platform native.
>
>
> Actually, Arrow is little-endian. It is an oversight that we haven't
> documented it as such. One of the key capabilities is to push it across the
> wire between separate systems without serialization (not just IPC). As such,
> we have to pick an endianness. If there is a huge need for a second
> big-endian encoding, we'll need to extend the spec to support that as a
> property.
>
>>
>>
>> 2.  The document specifies null bitmaps should have a length padded to
>> a multiple of 8 bytes to avoid word alignment concerns.   I assume the
>> concern is having a subsequent value buffer that isn't properly
>> aligned.  Are there other concerns that make the padding worthwhile?
>> If not, it might pay to define the term "contiguous memory buffer"
>> (already mentioned in the document) as a "contiguous memory region"
>> that starts at an address that is a multiple of 8 and not require the
>> padding (both values and bitmaps are stored in their own contiguous
>> memory buffers).   This would ensure that odd-length primitive byte
>> Arrays won't mess up alignment either.
>
>
> I believe this choice was primarily about simplifying the code (similar to
> why we have a n+1 offsets instead of just n in the list/varchar
> representations (even though n=0 is always 0)). In both situations, you
> don't have to worry about writing special code (and a condition) for the
> boundary condition inside tight loops (e.g. the last few bytes need to be
> handled differently since they aren't word width).
>
>>
>>
>> 3.  Unions contain type arrays defined as: "An array of unsigned
>> integers, enumerated from 0 corresponding to each type, with the
>> smallest byte width capable of representing the number of types in the
>> union."
>>
>> Making these integers signed eases compatibility with Java (something
>> we are already doing for other uses of integers). According to
>> Jacques, Arrow's Java implementation already works this way.
>
>
> Upon rechecking the code, I was mistaken. We actually use a signed byte and
> need to update to a two byte width to match the spec. Other notes: Java does
> have a 2 byte unsigned integer (char) so I'm not sure it matters that much
> (especially since we won't actually be doing math with this).
>
>>
>>
>> Additionally, in the rare case that it takes more the 2 bytes to
>> represent the number of types in the union, we might want to use the
>> next power of 2-byte width (i.e. 4-bytes).  This will make decoding
>> the types easier at the cost of additional memory.
>
>
> The original discussion was whether we really need more than a single byte.
> Part of this has to do with whether the system you're implementing allows
> multiple of the same type in the union. My personal preference is that
> unions should only exist where node type deviates (more like JSON than Avro)
> which is why we have previously used one byte in the java implementation (we
> actually only need 20 some options maximum if you only focus on base types).
> I understand that others think about the world differently and like the
> concept of a union of structs and thus relented to move to 2 bytes. Any more
> than that seems like overkill. (Are you really seeing a union of >64k
> structs frequently? In other words, I vote for waiting until there is
> substantial common demand before expanding.
>
>>
>>
>> 4.  The current maximum number of element in a column is 2^31-1.
>> https://issues.apache.org/jira/browse/SPARK-10399 lists a (contrived
>> use-case) of storing images.  This could make
>> https://issues.apache.org/jira/browse/ARROW-39 (fixed size
>> row-batches) awkward, since determining the correct number of rows for
>> each batch to avoid overflow would be difficult.
>
>
> Looking at ARROW-39, I asked for more information because I'm unclear as to
> what exactly Wes is thinking. Record batches are not required to be a fixed
> size and an implementation should be able to choose a pattern that fits
> their needs. (Root vectors within a batch do have to be the same size and
> that does cause allocation pain but is really the nature of pre-allocation.
> In fact, it is one of the strong reasons to try to keep the record batches
> fairly small in size. Otherwise memory management and estimation become
> unbearable.)
>
>>
>> There are a few
>> options I can think of to solve this:
>>    a.  Do nothing (just accept it isn't supported)
>>    b.  Change length of Arrays to be a 64 bit signed int and offsets
>> for List types to be 64 bit ints.
>>    c.  Define a new type "LargeList" that has offsets of 64 bit signed
>> integers, an additional list of corresponding lengths (signed 32 bit
>> integers) for each offset and a vector of nested value arrays (all
>> arrays except for the last one are equal in size and their size is a
>> power of 2 to facilitate efficient random access).   Each array will
>> only contain complete lists (offset and offset+length, will always
>> reference addresses in the same buffer).  This implies there will be
>> some fragmentation/unused space.  I think this is unavoidable.   Lists
>> spanning different arrays would require a memcpy/memory allocation or
>> it would require returning multiple offsets/lengths and make the
>> client reassemble the List.
>>
>> In the short term, option "a" seems reasonable.  In the longer term I
>> prefer option "c".   Option "b" would require allocating very large
>> chunks of contiguous memory and it doesn't seem like it would be
>> compatible with the Java (ByteBuffer.allocateDirect takes an int, I am
>> not sure about the Unsafe APIs).
>>
>
> Yeah, definitely (a) for now. As engineers, I think we can all come up with
> situations that we're constraining by the sizing we have in the current
> layout. That being said, records that store more than 2GB in a single leaf
> node are really going to have all sorts of problems including fragmentation
> issues, inability to move that data easily across rpc, etc that will
> probably break most systems that are doing things like trying to target
> batches of records that are ~256k.  In terms of whether (c) makes sense
> longer term, it is hard to say. My thought would be to wait until someone
> builds a logical version of this on top of the Arrow physical
> representations and then learn from what they did to try to incorporate
> something into an updated spec.
>
>
> Thanks for all the great ideas/questions/etc.
>
> Jacques
>
>

Re: Some questions/proposals for the spec (Layout.md)

Reply via email to