[ 
https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimitris Lekkas updated ARROW-5069:
-----------------------------------
    Description: 
I consider the option of memory-mapping columns to shared memory to be 
valuable. Such option will be triggered if specific metadata are supplied. 
Given that many data frames backed by arrow are used for machine learning I 
guess we could somehow benefit from treating differently the data (most likely 
data buffer columns) that will be fed into the GPUs/FPGAs. To enable such 
change we would need to address the following issues:

First, we need each column to hold an integer value representing its associated 
file descriptor. The application developer could retrieve the file-name from 
the file descriptor (i.e fstat syscall) and inform another application to 
reference that file or inform an FPGA to DMA that memory-area.

We also need to support variable buffer alignment (restricted to powers-of-2 of 
course)  when initiating an arrow::AllocateBuffer() call. By inspecting the 
current implementation, the alignment size is fixed at 64 bytes and to change 
that value a recompilation is required [1].

To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
heavily from page-aligned buffers since their device memory is 4KB [2]. 
Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
buffer from CPU memory to FPGA's memory [3]. 

Wouldn't it be nice if we could issue from_pandas() and then have our columns 
memory mapped to share memory for FPGAs to DMA such memory and accelerate the 
workload? If there is already a workaround to achieve that I would like more 
info on that.

I am open to discuss any suggestions, improvements or concerns. 

 

[1]: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]

[2]: 
[https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]

[3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]

  was:
I consider the option of memory-mapping columns to shared memory to be 
valuable. Such option will be triggered if specific metadata are supplied. 
Given that many data frames backed by arrow are used for machine learning I 
guess we could somehow benefit from treating differently the data (most likely 
data buffer columns) that will be fed into the GPUs/FPGAs. To enable such 
change we would need to address the following issues:

First, we need each column to hold an integer value representing its associated 
file descriptor. The application developer could retrieve the file-name from 
the file descriptor (i.e fstat syscall) and inform another application to 
reference that file or inform an FPGA to DMA that memory-area.

We also need to support variable buffer alignment (restricted to powers-of-2 of 
course)  when initiating an arrow::AllocateBuffer() call. By inspecting the 
current implementation, the alignment size is fixed at 64 bytes and to change 
that value a recompilation is required [1].

To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
heavily from page-aligned buffers since their device memory is 4KB [2]. 
Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
buffer from CPU memory to FPGA's memory [3]. 

I am open to discuss any suggestions, improvements or concerns. 

 

[1]: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]

[2]: 
[https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]

[3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]


> Implement direct support for shared memory arrow columns
> --------------------------------------------------------
>
>                 Key: ARROW-5069
>                 URL: https://issues.apache.org/jira/browse/ARROW-5069
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>         Environment: Linux
>            Reporter: Dimitris Lekkas
>            Priority: Major
>              Labels: perfomance, proposal
>             Fix For: 0.14.0
>
>
> I consider the option of memory-mapping columns to shared memory to be 
> valuable. Such option will be triggered if specific metadata are supplied. 
> Given that many data frames backed by arrow are used for machine learning I 
> guess we could somehow benefit from treating differently the data (most 
> likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable 
> such change we would need to address the following issues:
> First, we need each column to hold an integer value representing its 
> associated file descriptor. The application developer could retrieve the 
> file-name from the file descriptor (i.e fstat syscall) and inform another 
> application to reference that file or inform an FPGA to DMA that memory-area.
> We also need to support variable buffer alignment (restricted to powers-of-2 
> of course)  when initiating an arrow::AllocateBuffer() call. By inspecting 
> the current implementation, the alignment size is fixed at 64 bytes and to 
> change that value a recompilation is required [1].
> To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
> heavily from page-aligned buffers since their device memory is 4KB [2]. 
> Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
> buffer from CPU memory to FPGA's memory [3]. 
> Wouldn't it be nice if we could issue from_pandas() and then have our columns 
> memory mapped to share memory for FPGAs to DMA such memory and accelerate the 
> workload? If there is already a workaround to achieve that I would like more 
> info on that.
> I am open to discuss any suggestions, improvements or concerns. 
>  
> [1]: 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]
> [2]: 
> [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]
> [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to