Re: [QUESTION] How is mmap implemented for 8bit padded files?
Le 03/08/2022 à 18:29, Jorge Cardoso Leitão a écrit : Hi Antoine, Thanks a lot for your answer. So, if I understand (I may have not), we do not impose restrictions to the alignment of the data when we get the pointer; only when we read from it. Doesn't this require checking for alignment at runtime? Only if you do things that are alignment-sensitive. That said, while it is formally allowed AFAIK, it probably occurs rarely so potential issues (if any) are probably not surfaced. Best regards Antoine. Best, Jorge On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou wrote: Hi Jorge, So there are two aspects to the answer: - ideally, the C++ implementation also works on non-aligned data (though this is poorly tested, if any) - when mmap'ing a file, you should get a page-aligned address As for int128 and int256, these usually don't exist at the hardware level anyway, so implementing those reads as a combination of 64-bit reads shouldn't hurt performance-wise. More generally, I don't know about Rust but in C++ unaligned access would be made UB-safe by using the memcpy trick, which is correctly optimized by production compilers: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69 Regards Antoine. Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit : Hi, I am trying to follow the C++ implementation with respect to mmap IPC files and reading them zero-copy, in the context of reproducing it in Rust. My understanding from reading the source code is that we essentially: * identify the memory regions (offset and length) of each of the buffers, via IPC's flatbuffer "Node". * cast the uint8 pointer to the corresponding type based on the datatype (e.g. f32 for float32) I am struggling to understand how we ensure that the pointer is aligned [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted to it. In other words, I would expect mmap to work when: * the files' bit padding is 64 bits * the target type is <= 64 bits However, * we have types with more than 64 bits (int128 and int256) * a file can be 8-bit aligned The background is that Rust requires pointers to be aligned to the type for safe casting (it is UB to read unaligned pointers), and the above naturally poses a challenge when reading i128, i256 and 8-bit padded files. Best, Jorge [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc [2] https://en.wikipedia.org/wiki/Data_structure_alignment [3] https://stackoverflow.com/a/4322950/931303
Re: [QUESTION] How is mmap implemented for 8bit padded files?
Hi Antoine, Thanks a lot for your answer. So, if I understand (I may have not), we do not impose restrictions to the alignment of the data when we get the pointer; only when we read from it. Doesn't this require checking for alignment at runtime? Best, Jorge On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou wrote: > > Hi Jorge, > > So there are two aspects to the answer: > > - ideally, the C++ implementation also works on non-aligned data (though > this is poorly tested, if any) > > - when mmap'ing a file, you should get a page-aligned address > > As for int128 and int256, these usually don't exist at the hardware > level anyway, so implementing those reads as a combination of 64-bit > reads shouldn't hurt performance-wise. > > More generally, I don't know about Rust but in C++ unaligned access > would be made UB-safe by using the memcpy trick, which is correctly > optimized by production compilers: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69 > > Regards > > Antoine. > > > Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit : > > Hi, > > > > I am trying to follow the C++ implementation with respect to mmap IPC > files > > and reading them zero-copy, in the context of reproducing it in Rust. > > > > My understanding from reading the source code is that we essentially: > > * identify the memory regions (offset and length) of each of the buffers, > > via IPC's flatbuffer "Node". > > * cast the uint8 pointer to the corresponding type based on the datatype > > (e.g. f32 for float32) > > > > I am struggling to understand how we ensure that the pointer is aligned > > [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely > casted > > to it. > > > > In other words, I would expect mmap to work when: > > * the files' bit padding is 64 bits > > * the target type is <= 64 bits > > However, > > * we have types with more than 64 bits (int128 and int256) > > * a file can be 8-bit aligned > > > > The background is that Rust requires pointers to be aligned to the type > for > > safe casting (it is UB to read unaligned pointers), and the above > naturally > > poses a challenge when reading i128, i256 and 8-bit padded files. > > > > Best, > > Jorge > > > > [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc > > [2] https://en.wikipedia.org/wiki/Data_structure_alignment > > [3] https://stackoverflow.com/a/4322950/931303 > > >
Re: [QUESTION] How is mmap implemented for 8bit padded files?
Hi Jorge, So there are two aspects to the answer: - ideally, the C++ implementation also works on non-aligned data (though this is poorly tested, if any) - when mmap'ing a file, you should get a page-aligned address As for int128 and int256, these usually don't exist at the hardware level anyway, so implementing those reads as a combination of 64-bit reads shouldn't hurt performance-wise. More generally, I don't know about Rust but in C++ unaligned access would be made UB-safe by using the memcpy trick, which is correctly optimized by production compilers: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69 Regards Antoine. Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit : Hi, I am trying to follow the C++ implementation with respect to mmap IPC files and reading them zero-copy, in the context of reproducing it in Rust. My understanding from reading the source code is that we essentially: * identify the memory regions (offset and length) of each of the buffers, via IPC's flatbuffer "Node". * cast the uint8 pointer to the corresponding type based on the datatype (e.g. f32 for float32) I am struggling to understand how we ensure that the pointer is aligned [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted to it. In other words, I would expect mmap to work when: * the files' bit padding is 64 bits * the target type is <= 64 bits However, * we have types with more than 64 bits (int128 and int256) * a file can be 8-bit aligned The background is that Rust requires pointers to be aligned to the type for safe casting (it is UB to read unaligned pointers), and the above naturally poses a challenge when reading i128, i256 and 8-bit padded files. Best, Jorge [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc [2] https://en.wikipedia.org/wiki/Data_structure_alignment [3] https://stackoverflow.com/a/4322950/931303
[QUESTION] How is mmap implemented for 8bit padded files?
Hi, I am trying to follow the C++ implementation with respect to mmap IPC files and reading them zero-copy, in the context of reproducing it in Rust. My understanding from reading the source code is that we essentially: * identify the memory regions (offset and length) of each of the buffers, via IPC's flatbuffer "Node". * cast the uint8 pointer to the corresponding type based on the datatype (e.g. f32 for float32) I am struggling to understand how we ensure that the pointer is aligned [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted to it. In other words, I would expect mmap to work when: * the files' bit padding is 64 bits * the target type is <= 64 bits However, * we have types with more than 64 bits (int128 and int256) * a file can be 8-bit aligned The background is that Rust requires pointers to be aligned to the type for safe casting (it is UB to read unaligned pointers), and the above naturally poses a challenge when reading i128, i256 and 8-bit padded files. Best, Jorge [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc [2] https://en.wikipedia.org/wiki/Data_structure_alignment [3] https://stackoverflow.com/a/4322950/931303