Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-03 Thread Antoine Pitrou



Le 03/08/2022 à 18:29, Jorge Cardoso Leitão a écrit :

Hi Antoine,

Thanks a lot for your answer.

So, if I understand (I may have not), we do not impose restrictions to the
alignment of the data when we get the pointer; only when we read from it.
Doesn't this require checking for alignment at runtime?


Only if you do things that are alignment-sensitive.

That said, while it is formally allowed AFAIK, it probably occurs rarely 
so potential issues (if any) are probably not surfaced.


Best regards

Antoine.




Best,
Jorge



On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou  wrote:



Hi Jorge,

So there are two aspects to the answer:

- ideally, the C++ implementation also works on non-aligned data (though
this is poorly tested, if any)

- when mmap'ing a file, you should get a page-aligned address

As for int128 and int256, these usually don't exist at the hardware
level anyway, so implementing those reads as a combination of 64-bit
reads shouldn't hurt performance-wise.

More generally, I don't know about Rust but in C++ unaligned access
would be made UB-safe by using the memcpy trick, which is correctly
optimized by production compilers:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69

Regards

Antoine.


Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :

Hi,

I am trying to follow the C++ implementation with respect to mmap IPC

files

and reading them zero-copy, in the context of reproducing it in Rust.

My understanding from reading the source code is that we essentially:
* identify the memory regions (offset and length) of each of the buffers,
via IPC's flatbuffer "Node".
* cast the uint8 pointer to the corresponding type based on the datatype
(e.g. f32 for float32)

I am struggling to understand how we ensure that the pointer is aligned
[2,3] to the type (e.g. f32) so that the uint8 pointer can be safely

casted

to it.

In other words, I would expect mmap to work when:
* the files' bit padding is 64 bits
* the target type is <= 64 bits
However,
* we have types with more than 64 bits (int128 and int256)
* a file can be 8-bit aligned

The background is that Rust requires pointers to be aligned to the type

for

safe casting (it is UB to read unaligned pointers), and the above

naturally

poses a challenge when reading i128, i256 and 8-bit padded files.

Best,
Jorge

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
[2] https://en.wikipedia.org/wiki/Data_structure_alignment
[3] https://stackoverflow.com/a/4322950/931303







Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-03 Thread Jorge Cardoso Leitão
Hi Antoine,

Thanks a lot for your answer.

So, if I understand (I may have not), we do not impose restrictions to the
alignment of the data when we get the pointer; only when we read from it.
Doesn't this require checking for alignment at runtime?

Best,
Jorge



On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou  wrote:

>
> Hi Jorge,
>
> So there are two aspects to the answer:
>
> - ideally, the C++ implementation also works on non-aligned data (though
> this is poorly tested, if any)
>
> - when mmap'ing a file, you should get a page-aligned address
>
> As for int128 and int256, these usually don't exist at the hardware
> level anyway, so implementing those reads as a combination of 64-bit
> reads shouldn't hurt performance-wise.
>
> More generally, I don't know about Rust but in C++ unaligned access
> would be made UB-safe by using the memcpy trick, which is correctly
> optimized by production compilers:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69
>
> Regards
>
> Antoine.
>
>
> Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > I am trying to follow the C++ implementation with respect to mmap IPC
> files
> > and reading them zero-copy, in the context of reproducing it in Rust.
> >
> > My understanding from reading the source code is that we essentially:
> > * identify the memory regions (offset and length) of each of the buffers,
> > via IPC's flatbuffer "Node".
> > * cast the uint8 pointer to the corresponding type based on the datatype
> > (e.g. f32 for float32)
> >
> > I am struggling to understand how we ensure that the pointer is aligned
> > [2,3] to the type (e.g. f32) so that the uint8 pointer can be safely
> casted
> > to it.
> >
> > In other words, I would expect mmap to work when:
> > * the files' bit padding is 64 bits
> > * the target type is <= 64 bits
> > However,
> > * we have types with more than 64 bits (int128 and int256)
> > * a file can be 8-bit aligned
> >
> > The background is that Rust requires pointers to be aligned to the type
> for
> > safe casting (it is UB to read unaligned pointers), and the above
> naturally
> > poses a challenge when reading i128, i256 and 8-bit padded files.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
> > [2] https://en.wikipedia.org/wiki/Data_structure_alignment
> > [3] https://stackoverflow.com/a/4322950/931303
> >
>


Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-02 Thread Antoine Pitrou



Hi Jorge,

So there are two aspects to the answer:

- ideally, the C++ implementation also works on non-aligned data (though 
this is poorly tested, if any)


- when mmap'ing a file, you should get a page-aligned address

As for int128 and int256, these usually don't exist at the hardware 
level anyway, so implementing those reads as a combination of 64-bit 
reads shouldn't hurt performance-wise.


More generally, I don't know about Rust but in C++ unaligned access 
would be made UB-safe by using the memcpy trick, which is correctly 
optimized by production compilers:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L55-L69

Regards

Antoine.


Le 01/08/2022 à 18:55, Jorge Cardoso Leitão a écrit :

Hi,

I am trying to follow the C++ implementation with respect to mmap IPC files
and reading them zero-copy, in the context of reproducing it in Rust.

My understanding from reading the source code is that we essentially:
* identify the memory regions (offset and length) of each of the buffers,
via IPC's flatbuffer "Node".
* cast the uint8 pointer to the corresponding type based on the datatype
(e.g. f32 for float32)

I am struggling to understand how we ensure that the pointer is aligned
[2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted
to it.

In other words, I would expect mmap to work when:
* the files' bit padding is 64 bits
* the target type is <= 64 bits
However,
* we have types with more than 64 bits (int128 and int256)
* a file can be 8-bit aligned

The background is that Rust requires pointers to be aligned to the type for
safe casting (it is UB to read unaligned pointers), and the above naturally
poses a challenge when reading i128, i256 and 8-bit padded files.

Best,
Jorge

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
[2] https://en.wikipedia.org/wiki/Data_structure_alignment
[3] https://stackoverflow.com/a/4322950/931303



[QUESTION] How is mmap implemented for 8bit padded files?

2022-08-01 Thread Jorge Cardoso Leitão
Hi,

I am trying to follow the C++ implementation with respect to mmap IPC files
and reading them zero-copy, in the context of reproducing it in Rust.

My understanding from reading the source code is that we essentially:
* identify the memory regions (offset and length) of each of the buffers,
via IPC's flatbuffer "Node".
* cast the uint8 pointer to the corresponding type based on the datatype
(e.g. f32 for float32)

I am struggling to understand how we ensure that the pointer is aligned
[2,3] to the type (e.g. f32) so that the uint8 pointer can be safely casted
to it.

In other words, I would expect mmap to work when:
* the files' bit padding is 64 bits
* the target type is <= 64 bits
However,
* we have types with more than 64 bits (int128 and int256)
* a file can be 8-bit aligned

The background is that Rust requires pointers to be aligned to the type for
safe casting (it is UB to read unaligned pointers), and the above naturally
poses a challenge when reading i128, i256 and 8-bit padded files.

Best,
Jorge

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc
[2] https://en.wikipedia.org/wiki/Data_structure_alignment
[3] https://stackoverflow.com/a/4322950/931303