https://issues.apache.org/jira/browse/ARROW-104 covers the change to the spec. I should hopefully have a pull request out sometime this weekend. Once it was merged I was going to open up JIRAs for C++ and Java code.
On Fri, Apr 22, 2016 at 12:18 PM, Wes McKinney <w...@cloudera.com> wrote: > Shall we create a JIRA to codify the alignment settings someplace in > the metadata? We may need an umbrella JIRA to include downstream > issues tracking code changes to implement and test the default > alignment. > > On Sat, Apr 9, 2016 at 11:34 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: >> Hi Kai, >> I think we are in agreement. For cross machine transfers, alignment >> doesn't make a difference but width does. We could optimize >> transfers, by only sending non-padded buffers and then have clients >> re-pad the data (but before we optimize we should likely have a >> working implementation :) >> >> The use-case I was thinking of for transferring alignment as part of >> the metadata was for the shared-memory IPC. I just double checked the >> current IPC metadata >> (https://github.com/apache/arrow/blob/master/format/Message.fbs), and >> all the necessary information is already in there to check for >> alignment and width. So passing along the alignment/width >> requirements shouldn't be necessary. We should be able to check >> offset and lengths when reading the metadata. >> >> All of this might be getting into a little bit of premature >> optimization though. I will wait a little bit longer for comments >> from others and then update the spec. >> >> Thanks, >> Micah >> >> On Sat, Apr 9, 2016 at 4:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote: >>> Hi Micah, >>> >>> Thanks for your thorough thoughts. The general consideration makes sense to >>> me building Arrow with SIMD support in mind meanwhile not complicating >>> codes too much, to use a fixed value 64 byte by default. We can always >>> improve and optimize this accordingly when have concrete solid algorithms >>> and workloads to benchmark with and collect real performance data, as you >>> said. >>> >>> One thing I'm not sure about is, whether alignment requirement should be >>> included in IPC metadata, because in my understanding, no buffer address is >>> needed to be passed across machines, so it's up to destination machine to >>> decide how to reallocate the data buffer for receiving the transferred data >>> with whatever alignment address. An alignment address that's good to the >>> source machine but may be not good to the destination. >>> >>> So in summary, it's good to mention this alignment consideration in the >>> spec, also saying the fixed 64 byte alignment address is used by default; >>> and hard-code the fixed value in source codes (for example, when allocating >>> buffers for primitive arrays, chunk array buffers, and null bitmap buffers). >>> >>> Please help clarify if I'm not getting you right. Thanks. >>> >>> Regards, >>> Kai >>> >>> -----Original Message----- >>> From: Micah Kornfield [mailto:emkornfi...@gmail.com] >>> Sent: Saturday, April 09, 2016 12:56 PM >>> To: dev@arrow.apache.org >>> Subject: Re: Some questions/proposals for the spec (Layout.md) >>> >>> Hi Kai, >>> Are you proposing making alignment and width part of the RPC metadata? >>> I think this is a good longer term idea, but for simplicity's sake, I >>> think starting with one fixed value is a good idea. >>> >>> I agree that in the general case guaranteeing alignment is difficult when >>> we have variable width data (e.g. strings) or sliced data >>> (https://issues.apache.org/jira/browse/ARROW-33). However, I think a >>> fairly common use-case for Arrow will be dealing with fixed width >>> non-nested types (e.g. float, doubles, int32_t) where alignment can be >>> guaranteed. In these cases being able to make use of the optimal CPU >>> instruction set is important. >>> >>> In this regard, one concern with 8 bytes as the default width is that it >>> will cause suboptimal use of current CPUs. For instance, the Intel >>> Optimization Guide >>> (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) >>> states "An access to data unaligned on 64-byte boundary leads to two memory >>> accesses and requires several μops to be executed (instead of one)." and "A >>> 64-byte or greater data structure or array should be aligned so that its >>> base address is a multiple of 64." >>> >>> It would be interesting to know the exact performance difference for >>> compiler generated code knowing about different degrees of alignment/width >>> as well as the performance difference using assembly/intrinsics. In the >>> absence of the performance data, I think defaulting to 64 byte alignment >>> (when the programming language allows for it) based on recommendation from >>> the guide makes sense. In addition given the existence of 512-bit SIMD, >>> using 64 byte padding for width also makes sense. >>> >>> Do you have concerns if we make 64 bytes the default instead of 8? >>> >>> Thanks, >>> Micah >>> >>> On Fri, Apr 8, 2016 at 1:44 PM, Zheng, Kai <kai.zh...@intel.com> wrote: >>>> I'm from Intel but not any hardware folks, just would provide my thoughts. >>>> Yes the width and alignment requirement can be very different according to >>>> what version of SIMD is used. And also, sometimes it's hard to keep the >>>> alignment to access specific fields or parts in the even aligned memory >>>> region. It's complex, I thought it's good to mention this aspect of >>>> consideration in the spec but come to the data structures or format, it >>>> can leave to platform specific optimizations regarding to concrete >>>> computing operators and algorithms to use alignment awareness buffer >>>> allocators considering this potential performance impact. A default value >>>> of 8 as mentioned may be used but other values can also be passed. >>>> >>>> Regards, >>>> Kai >>>> >>>> -----Original Message----- >>>> From: Wes McKinney [mailto:w...@cloudera.com] >>>> Sent: Friday, April 08, 2016 11:40 PM >>>> To: dev@arrow.apache.org >>>> Subject: Re: Some questions/proposals for the spec (Layout.md) >>>> >>>> On the SIMD question, it seems AVX is going to 512 bits, so one could even >>>> argue for 64-byte alignment as a matter of future-proofing. AVX2 / >>>> 256-bit seems fairly widely available nowadays, but it would be great if >>>> Todd or any of the hardware folks (e.g. from Intel) on the list could >>>> weigh in with guidance. >>>> >>>> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions >>>> >>>> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: >>>>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote: >>>>>>> >>>>>>> >>>>>>> > I believe this choice was primarily about simplifying the code >>>>>>> > (similar >>>>>>> to why we have a n+1 >>>>>>> > offsets instead of just n in the list/varchar representations >>>>>>> > (even >>>>>>> though n=0 is always 0)). In both >>>>>>> > situations, you don't have to worry about writing special code >>>>>>> > (and a >>>>>>> condition) for the boundary >>>>>>> > condition inside tight loops (e.g. the last few bytes need to be >>>>>>> > handled >>>>>>> differently since they >>>>>>> > aren't word width). >>>>>>> >>>>>>> Sounds reasonable. It might be worth illustrating this with a >>>>>>> concrete example. One scenario that this scheme seems useful for >>>>>>> is a creating a new bitmap based on evaluating a predicate (i.e. >>>>>>> all elements >X). In this case would it make sense to make it a >>>>>>> multiple of 16, so we can consistently use SIMD instructions for >>>>>>> the logical "and" operation? >>>>>>> >>>>>> >>>>>> Hmm... interesting thought. I'd have to look but I also recall some >>>>>> of the newer stuff supporting even wider widths. What do others think? >>>>>> >>>>>> >>>>>>> I think the spec is slightly inconsistent. It says there is 6 >>>>>>> bytes of overhead per entry but then follows: "with the smallest >>>>>>> byte width capable of representing the number of types in the union." >>>>>>> I'm perfectly happy to say it is always 1, always 2, or always >>>>>>> capped at 2. I agree 32K/64K+ types is a very unlikely scenario. >>>>>>> We just need to clear up the ambiguity. >>>>>>> >>>>>> >>>>>> Agreed. Do you want to propose an approach & patch to clarify? >>>>> >>>>> I can also take responsibility for the ambiguity here. My preference >>>>> is to use int16_t for the types array (memory suitably aligned), but >>>>> as 1 byte will be sufficient nearly all of the time, it's a slight >>>>> trade-off in memory use vs. code complexity, e.g. >>>>> >>>>> if (children_.size() < 128) { >>>>> // types is only 1 byte >>>>> } else { >>>>> // types is 2 bytes >>>>> } >>>>> >>>>> Realistically there won't be that many affected code paths, so I'm >>>>> comfortable with either choice (2-bytes always, or 1 or 2 bytes >>>>> depending on the size of the union).