Shall we create a JIRA to codify the alignment settings someplace in
the metadata? We may need an umbrella JIRA to include downstream
issues tracking code changes to implement and test the default
alignment.

On Sat, Apr 9, 2016 at 11:34 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> Hi Kai,
> I think we are in agreement.  For cross machine transfers, alignment
> doesn't make a difference but width does.  We could optimize
> transfers, by only sending non-padded buffers and then have clients
> re-pad the data (but before we optimize we should likely have a
> working implementation :)
>
> The use-case I was thinking of for transferring alignment as part of
> the metadata was for the shared-memory IPC.  I just double checked the
> current IPC metadata
> (https://github.com/apache/arrow/blob/master/format/Message.fbs), and
> all the necessary information is already in there to check for
> alignment and width.  So passing along the alignment/width
> requirements shouldn't be necessary.  We should be able to check
> offset and lengths when reading the metadata.
>
> All of this might be getting into a little bit of premature
> optimization though.  I will wait a little bit longer for comments
> from others and then update the spec.
>
> Thanks,
> Micah
>
> On Sat, Apr 9, 2016 at 4:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>> Hi Micah,
>>
>> Thanks for your thorough thoughts. The general consideration makes sense to 
>> me building Arrow with SIMD support in mind meanwhile not complicating codes 
>> too much, to use a fixed value 64 byte by default. We can always improve and 
>> optimize this accordingly when have concrete solid algorithms and workloads 
>> to benchmark with and collect real performance data, as you said.
>>
>> One thing I'm not sure about is, whether alignment requirement should be 
>> included in IPC metadata, because in my understanding, no buffer address is 
>> needed to be passed across machines, so it's up to destination machine to 
>> decide how to reallocate the data buffer for receiving the transferred data 
>> with whatever alignment address. An alignment address that's good to the 
>> source machine but may be not good to the destination.
>>
>> So in summary, it's good to mention this alignment consideration in the 
>> spec, also saying the fixed 64 byte alignment address is used by default; 
>> and hard-code the fixed value in source codes (for example, when allocating 
>> buffers for primitive arrays, chunk array buffers, and null bitmap buffers).
>>
>> Please help clarify if I'm not getting you right. Thanks.
>>
>> Regards,
>> Kai
>>
>> -----Original Message-----
>> From: Micah Kornfield [mailto:emkornfi...@gmail.com]
>> Sent: Saturday, April 09, 2016 12:56 PM
>> To: dev@arrow.apache.org
>> Subject: Re: Some questions/proposals for the spec (Layout.md)
>>
>> Hi Kai,
>> Are you proposing making alignment and width part of the RPC metadata?
>>   I think this is a good longer term idea, but for simplicity's sake, I 
>> think starting with one fixed value is a good idea.
>>
>> I agree that in the general case guaranteeing alignment is difficult when we 
>> have variable width data (e.g. strings) or sliced data 
>> (https://issues.apache.org/jira/browse/ARROW-33).  However, I think a fairly 
>> common use-case for Arrow will be dealing with fixed width non-nested types 
>> (e.g. float, doubles, int32_t) where alignment can be
>> guaranteed.   In these cases being able to make use of the optimal CPU
>> instruction set is important.
>>
>> In this regard, one concern with 8 bytes as the default width is that it 
>> will cause suboptimal use of current CPUs.  For instance, the Intel 
>> Optimization Guide
>> (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf)
>> states "An access to data unaligned on 64-byte boundary leads to two memory 
>> accesses and requires several μops to be executed (instead of one)." and "A 
>> 64-byte or greater data structure or array should be aligned so that its 
>> base address is a multiple of 64."
>>
>> It would be interesting to know the exact performance difference for 
>> compiler generated code knowing about different degrees of alignment/width 
>> as well as the performance difference using assembly/intrinsics.  In the 
>> absence of the performance data, I think defaulting to 64 byte alignment 
>> (when the programming language allows for it) based on recommendation from 
>> the guide makes sense.  In addition given the existence of 512-bit SIMD, 
>> using 64 byte padding for width also makes sense.
>>
>> Do you have concerns if we make 64 bytes the default instead of 8?
>>
>> Thanks,
>> Micah
>>
>> On Fri, Apr 8, 2016 at 1:44 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>>> I'm from Intel but not any hardware folks, just would provide my thoughts. 
>>> Yes the width and alignment requirement can be very different according to 
>>> what version of SIMD is used. And also, sometimes it's hard to keep the 
>>> alignment to access specific fields or parts in the even aligned memory 
>>> region. It's complex, I thought it's good to mention this aspect of 
>>> consideration in the spec but come to the data structures or format, it can 
>>> leave to platform specific optimizations regarding to concrete computing 
>>> operators and algorithms to use alignment awareness buffer allocators 
>>> considering this potential performance impact. A default value of 8 as 
>>> mentioned may be used but other values can also be passed.
>>>
>>> Regards,
>>> Kai
>>>
>>> -----Original Message-----
>>> From: Wes McKinney [mailto:w...@cloudera.com]
>>> Sent: Friday, April 08, 2016 11:40 PM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Some questions/proposals for the spec (Layout.md)
>>>
>>> On the SIMD question, it seems AVX is going to 512 bits, so one could even 
>>> argue for 64-byte alignment as a matter of future-proofing.  AVX2 / 256-bit 
>>> seems fairly widely available nowadays, but it would be great if Todd or 
>>> any of the hardware folks (e.g. from Intel) on the list could weigh in with 
>>> guidance.
>>>
>>> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>>>
>>> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
>>>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>>>>
>>>>>>
>>>>>> > I believe this choice was primarily about simplifying the code
>>>>>> > (similar
>>>>>> to why we have a n+1
>>>>>> > offsets instead of just n in the list/varchar representations
>>>>>> > (even
>>>>>> though n=0 is always 0)). In both
>>>>>> > situations, you don't have to worry about writing special code
>>>>>> > (and a
>>>>>> condition) for the boundary
>>>>>> > condition inside tight loops (e.g. the last few bytes need to be
>>>>>> > handled
>>>>>> differently since they
>>>>>> > aren't word width).
>>>>>>
>>>>>> Sounds reasonable.  It might be worth illustrating this with a
>>>>>> concrete example.  One scenario that this scheme seems useful for
>>>>>> is a creating a new bitmap based on evaluating a predicate (i.e.
>>>>>> all elements >X).  In this case would it make sense to make it a
>>>>>> multiple of 16, so we can consistently use SIMD instructions for
>>>>>> the logical "and" operation?
>>>>>>
>>>>>
>>>>> Hmm... interesting thought. I'd have to look but I also recall some
>>>>> of the newer stuff supporting even wider widths. What do others think?
>>>>>
>>>>>
>>>>>> I think the spec is slightly inconsistent.  It says there is 6
>>>>>> bytes of overhead per entry but then follows: "with the smallest
>>>>>> byte width capable of representing the number of types in the union."
>>>>>> I'm perfectly happy to say it is always 1, always 2, or always
>>>>>> capped at 2.  I agree 32K/64K+ types is a very unlikely scenario.
>>>>>> We just need to clear up the ambiguity.
>>>>>>
>>>>>
>>>>> Agreed. Do you want to propose an approach & patch to clarify?
>>>>
>>>> I can also take responsibility for the ambiguity here. My preference
>>>> is to use int16_t for the types array (memory suitably aligned), but
>>>> as 1 byte will be sufficient nearly all of the time, it's a slight
>>>> trade-off in memory use vs. code complexity, e.g.
>>>>
>>>> if (children_.size() < 128) {
>>>>   // types is only 1 byte
>>>> } else {
>>>>   // types is 2 bytes
>>>> }
>>>>
>>>> Realistically there won't be that many affected code paths, so I'm
>>>> comfortable with either choice (2-bytes always, or 1 or 2 bytes
>>>> depending on the size of the union).

Reply via email to