[ https://issues.apache.org/jira/browse/ARROW-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624891#comment-17624891 ]
Jochen Ott commented on ARROW-18141: ------------------------------------ Thanks for the quick fix. Unfortunately, I currently don't have the time to try this on full arrow (I've set up all dependencies and such for arrow 5.0.0, so I'd need to backport this ...). However, I'm pretty confident this fixes the issue, as I tries the following code on godbolt.org (note that I had to slightly modify SafeLoadAs to compile on 6.3.0): {code:C++} #include <inttypes.h> #include <algorithm> #include <type_traits> #include <cstring> template <typename T> inline std::enable_if_t<std::is_trivially_copyable<T>::value, T> SafeLoadAs( const uint8_t* unaligned) { typename std::remove_const<T>::type ret; std::memcpy(&ret, unaligned, sizeof(T)); return ret; } std::pair<int64_t, int64_t> minmax(const int64_t * values, int64_t n) { int64_t minval = 0, maxval = 0; for(int64_t i=0; i<n; ++i){ // "old": int64_t value = values[i]; // "new": // int64_t value = SafeLoadAs<int64_t>((const uint8_t *)(values + i)); minval = std::min(minval, value); maxval = std::max(maxval, value); } return std::make_pair(minval, maxval); } {code} Using compiler options "-O3 -march=sandybridge", gcc 6.3.0 produces an instruction "vmovdqa" for the "old" code. For the "new" variant, it produces "vmovdqu", though, which seems like the best we could hope for. > [C++] Alignment not enforced; undefined behavior > ------------------------------------------------ > > Key: ARROW-18141 > URL: https://issues.apache.org/jira/browse/ARROW-18141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Reporter: Jochen Ott > Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Attachments: test1.py > > Time Spent: 40m > Remaining Estimate: 0h > > It is possible to create arrays using unaligned memory addresses (e.g. for > int64). This seems to be in line with the arrow specification which as far as > I understand does not require alignment [1]. > However, the C++ standard requires alignment, e.g. 8 byte alignment for > int64. It is undefined behavior (UB) to create an unaligned pointer / > accessing data via an unaligned pointer. > Typically, this is not an issue in practice on x86, since gcc and other > compilers mostly emit instructions that can deal with unaligned data. > However, for gcc 6.3.0 (and probably up to including gcc versions 7.X), this > code: > [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/statistics.cc#L355] > creates an aligned move instruction (movdqa) for the expression > {{{}values[i]{}}}. This, in turn, triggers a SIGSEGV in case {{values}} is > called via an unaligned buffer. Later compiler versions (in particular gcc > 9.X used to build the wheels published on pypi) will emit instructions that > can deal with unaligned data (movdqu instead of movdqa). > The python script "test1.py" reproduces this issue on python-level; note that > it will only trigger a SIGSEGV if compiling arrow with a compiler that emits > movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow. > In the wild, unaligned buffers are rare, but can appear, e.g. as a result of > deserializing pandas dataframes / numpy arrays using pickle protocol 5 that > allows out-of-band byte buffers that are re-used as arrow array buffers. > I think the line to first enter the UB regime is this reinterpret_cast: > https://github.com/apache/arrow/blob/33f2c0ec8e281fc4fe8c03b07ed2d32e343d9b0e/cpp/src/parquet/column_writer.cc#L1592 > [1][https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding] > merely "recommends" that buffers are aligned, but does not require it. -- This message was sent by Atlassian Jira (v8.20.10#820010)