[ 
https://issues.apache.org/jira/browse/ARROW-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291207#comment-16291207
 ] 

Wes McKinney commented on ARROW-1925:
-------------------------------------

We avoid copies when converting from Arrow memory (which may have been 
materialized from Parquet) to NumPy in many cases, for example numerical data 
without nulls. When data has nulls or non-numeric data, generally copies are 
required (since we have to represent the data in some NumPy compatible way, 
since NumPy is not aware of Arrow's memory layout for null values or strings, 
for example)

> Wrapping PyArrow Table with Numpy without copy
> ----------------------------------------------
>
>                 Key: ARROW-1925
>                 URL: https://issues.apache.org/jira/browse/ARROW-1925
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Young-Jun Ko
>            Priority: Minor
>
> The scenario is the following:
> I have a parquet file, which has a column containing a float array of 
> constant size.
> So it can be thought of as a matrix.
> When I read the parquet file, the way I currently access it, is to convert it 
> to pandas, extract the values, giving me a list of np.array and then doing 
> np.vstack to get the matrix.
> This involves a copy that would be nice to avoid.
> When a parquet file (or more generally a parquet dataset) is read, would the 
> values of the array column be contiguous in memory, so that a view on the 
> data could be created without having to copy? That would be neat.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to