[ 
https://issues.apache.org/jira/browse/ARROW-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360051#comment-17360051
 ] 

David Li commented on ARROW-13014:
----------------------------------

In that case yes, I think it's a duplicate. Observe:
{code:python}
>>> import pyarrow as pa
>>> import pandas as pd
>>> pa.__version__
'4.0.1'
>>> pd.__version__
'1.2.4'
>>> pd.DataFrame({"str": ["a" * 16384] * (2**17 - 
>>> 1)}).reset_index().to_feather('/tmp/foo.feather')
# Finishes successfully
>>> pd.DataFrame({"str": ["a" * 16384] * (2**17 + 
>>> 1)}).reset_index().to_feather('/tmp/foo.feather')
# Hangs while memory usage shoots up
{code}

> Pandas to_feather no longer works - runs out of memory
> ------------------------------------------------------
>
>                 Key: ARROW-13014
>                 URL: https://issues.apache.org/jira/browse/ARROW-13014
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 4.0.0, 4.0.1
>         Environment: Linux
>            Reporter: Roland Swingler
>            Priority: Major
>
> Since upgrading to 4.0.1 writing to feather files with the pandas to_feather 
> method uses up far, far more memory.
> For reference I have a dataframe that is around 10gb in size, 25 million 
> rows. Writing a feather file took around 3-4gb of memory in pyarrow versions 
> up to 3.0.0. As of 4.0.1 I don't know how much memory it will take to 
> successfully write - I tried running on a 120gb AWS machine, and that wasn't 
> sufficient.
> I can't provide the dataframe, but I can give an outline of the types / sizes 
> of the columns:
> size (bytes),type
> 206663144,int64
> 206663144,int64
> 206663144,float64
> 206663144,float64
> 2882448709,object
> 5813798687,object
> 206663144,float64
> 206663144,int64
> 206663144,int64
> 206663144,int64
> 206663144,int64
> 206663144,float64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to