https://issues.apache.org/jira/browse/ARROW-104 covers the change to
the spec.  I should hopefully have a pull request out sometime this
weekend.  Once it was merged I was going to open up JIRAs for C++ and
Java code.

On Fri, Apr 22, 2016 at 12:18 PM, Wes McKinney <w...@cloudera.com> wrote:
> Shall we create a JIRA to codify the alignment settings someplace in
> the metadata? We may need an umbrella JIRA to include downstream
> issues tracking code changes to implement and test the default
> alignment.
>
> On Sat, Apr 9, 2016 at 11:34 PM, Micah Kornfield <emkornfi...@gmail.com> 
> wrote:
>> Hi Kai,
>> I think we are in agreement.  For cross machine transfers, alignment
>> doesn't make a difference but width does.  We could optimize
>> transfers, by only sending non-padded buffers and then have clients
>> re-pad the data (but before we optimize we should likely have a
>> working implementation :)
>>
>> The use-case I was thinking of for transferring alignment as part of
>> the metadata was for the shared-memory IPC.  I just double checked the
>> current IPC metadata
>> (https://github.com/apache/arrow/blob/master/format/Message.fbs), and
>> all the necessary information is already in there to check for
>> alignment and width.  So passing along the alignment/width
>> requirements shouldn't be necessary.  We should be able to check
>> offset and lengths when reading the metadata.
>>
>> All of this might be getting into a little bit of premature
>> optimization though.  I will wait a little bit longer for comments
>> from others and then update the spec.
>>
>> Thanks,
>> Micah
>>
>> On Sat, Apr 9, 2016 at 4:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>>> Hi Micah,
>>>
>>> Thanks for your thorough thoughts. The general consideration makes sense to 
>>> me building Arrow with SIMD support in mind meanwhile not complicating 
>>> codes too much, to use a fixed value 64 byte by default. We can always 
>>> improve and optimize this accordingly when have concrete solid algorithms 
>>> and workloads to benchmark with and collect real performance data, as you 
>>> said.
>>>
>>> One thing I'm not sure about is, whether alignment requirement should be 
>>> included in IPC metadata, because in my understanding, no buffer address is 
>>> needed to be passed across machines, so it's up to destination machine to 
>>> decide how to reallocate the data buffer for receiving the transferred data 
>>> with whatever alignment address. An alignment address that's good to the 
>>> source machine but may be not good to the destination.
>>>
>>> So in summary, it's good to mention this alignment consideration in the 
>>> spec, also saying the fixed 64 byte alignment address is used by default; 
>>> and hard-code the fixed value in source codes (for example, when allocating 
>>> buffers for primitive arrays, chunk array buffers, and null bitmap buffers).
>>>
>>> Please help clarify if I'm not getting you right. Thanks.
>>>
>>> Regards,
>>> Kai
>>>
>>> -----Original Message-----
>>> From: Micah Kornfield [mailto:emkornfi...@gmail.com]
>>> Sent: Saturday, April 09, 2016 12:56 PM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Some questions/proposals for the spec (Layout.md)
>>>
>>> Hi Kai,
>>> Are you proposing making alignment and width part of the RPC metadata?
>>>   I think this is a good longer term idea, but for simplicity's sake, I 
>>> think starting with one fixed value is a good idea.
>>>
>>> I agree that in the general case guaranteeing alignment is difficult when 
>>> we have variable width data (e.g. strings) or sliced data 
>>> (https://issues.apache.org/jira/browse/ARROW-33).  However, I think a 
>>> fairly common use-case for Arrow will be dealing with fixed width 
>>> non-nested types (e.g. float, doubles, int32_t) where alignment can be
>>> guaranteed.   In these cases being able to make use of the optimal CPU
>>> instruction set is important.
>>>
>>> In this regard, one concern with 8 bytes as the default width is that it 
>>> will cause suboptimal use of current CPUs.  For instance, the Intel 
>>> Optimization Guide
>>> (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf)
>>> states "An access to data unaligned on 64-byte boundary leads to two memory 
>>> accesses and requires several μops to be executed (instead of one)." and "A 
>>> 64-byte or greater data structure or array should be aligned so that its 
>>> base address is a multiple of 64."
>>>
>>> It would be interesting to know the exact performance difference for 
>>> compiler generated code knowing about different degrees of alignment/width 
>>> as well as the performance difference using assembly/intrinsics.  In the 
>>> absence of the performance data, I think defaulting to 64 byte alignment 
>>> (when the programming language allows for it) based on recommendation from 
>>> the guide makes sense.  In addition given the existence of 512-bit SIMD, 
>>> using 64 byte padding for width also makes sense.
>>>
>>> Do you have concerns if we make 64 bytes the default instead of 8?
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Fri, Apr 8, 2016 at 1:44 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
>>>> I'm from Intel but not any hardware folks, just would provide my thoughts. 
>>>> Yes the width and alignment requirement can be very different according to 
>>>> what version of SIMD is used. And also, sometimes it's hard to keep the 
>>>> alignment to access specific fields or parts in the even aligned memory 
>>>> region. It's complex, I thought it's good to mention this aspect of 
>>>> consideration in the spec but come to the data structures or format, it 
>>>> can leave to platform specific optimizations regarding to concrete 
>>>> computing operators and algorithms to use alignment awareness buffer 
>>>> allocators considering this potential performance impact. A default value 
>>>> of 8 as mentioned may be used but other values can also be passed.
>>>>
>>>> Regards,
>>>> Kai
>>>>
>>>> -----Original Message-----
>>>> From: Wes McKinney [mailto:w...@cloudera.com]
>>>> Sent: Friday, April 08, 2016 11:40 PM
>>>> To: dev@arrow.apache.org
>>>> Subject: Re: Some questions/proposals for the spec (Layout.md)
>>>>
>>>> On the SIMD question, it seems AVX is going to 512 bits, so one could even 
>>>> argue for 64-byte alignment as a matter of future-proofing.  AVX2 / 
>>>> 256-bit seems fairly widely available nowadays, but it would be great if 
>>>> Todd or any of the hardware folks (e.g. from Intel) on the list could 
>>>> weigh in with guidance.
>>>>
>>>> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>>>>
>>>> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
>>>>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>> > I believe this choice was primarily about simplifying the code
>>>>>>> > (similar
>>>>>>> to why we have a n+1
>>>>>>> > offsets instead of just n in the list/varchar representations
>>>>>>> > (even
>>>>>>> though n=0 is always 0)). In both
>>>>>>> > situations, you don't have to worry about writing special code
>>>>>>> > (and a
>>>>>>> condition) for the boundary
>>>>>>> > condition inside tight loops (e.g. the last few bytes need to be
>>>>>>> > handled
>>>>>>> differently since they
>>>>>>> > aren't word width).
>>>>>>>
>>>>>>> Sounds reasonable.  It might be worth illustrating this with a
>>>>>>> concrete example.  One scenario that this scheme seems useful for
>>>>>>> is a creating a new bitmap based on evaluating a predicate (i.e.
>>>>>>> all elements >X).  In this case would it make sense to make it a
>>>>>>> multiple of 16, so we can consistently use SIMD instructions for
>>>>>>> the logical "and" operation?
>>>>>>>
>>>>>>
>>>>>> Hmm... interesting thought. I'd have to look but I also recall some
>>>>>> of the newer stuff supporting even wider widths. What do others think?
>>>>>>
>>>>>>
>>>>>>> I think the spec is slightly inconsistent.  It says there is 6
>>>>>>> bytes of overhead per entry but then follows: "with the smallest
>>>>>>> byte width capable of representing the number of types in the union."
>>>>>>> I'm perfectly happy to say it is always 1, always 2, or always
>>>>>>> capped at 2.  I agree 32K/64K+ types is a very unlikely scenario.
>>>>>>> We just need to clear up the ambiguity.
>>>>>>>
>>>>>>
>>>>>> Agreed. Do you want to propose an approach & patch to clarify?
>>>>>
>>>>> I can also take responsibility for the ambiguity here. My preference
>>>>> is to use int16_t for the types array (memory suitably aligned), but
>>>>> as 1 byte will be sufficient nearly all of the time, it's a slight
>>>>> trade-off in memory use vs. code complexity, e.g.
>>>>>
>>>>> if (children_.size() < 128) {
>>>>>   // types is only 1 byte
>>>>> } else {
>>>>>   // types is 2 bytes
>>>>> }
>>>>>
>>>>> Realistically there won't be that many affected code paths, so I'm
>>>>> comfortable with either choice (2-bytes always, or 1 or 2 bytes
>>>>> depending on the size of the union).

Reply via email to