[jira] [Updated] (ARROW-432) [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:11:11 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rok Mihevc updated ARROW-432:
-----------------------------
    External issue URL: https://github.com/apache/arrow/issues/16079

> [Python] Avoid unnecessary memory copy in to_pandas conversion by using 
> low-level pandas internals APIs
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-432
>                 URL: https://issues.apache.org/jira/browse/ARROW-432
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 0.2.0
>
>
> I'll take this one on. 
> While we're efficiently constructing individual NumPy arrays for pandas, even 
> in the zero-copy case pandas.DataFrame will perform an extra memory copy and 
> consolidation step internally at the end. 
> This is particular to the pandas 0.x/1.x memory layout, and will change in 
> the future with pandas 2.0, but that's quite a ways off from wide use. 
> We can avoid this overhead for now by
> * computing the exact internal "block" structure of the DataFrame. Since we 
> know the null counts of the Arrow data, we can determine if type casts to 
> accommodate nulls are necessary up front
> * pre-allocating empty column-major blocks
> * writing out into the block slices
> * construct DataFrame from blocks with zero copy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-432) [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs

Reply via email to