[ https://issues.apache.org/jira/browse/ARROW-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360051#comment-17360051 ]
David Li commented on ARROW-13014: ---------------------------------- In that case yes, I think it's a duplicate. Observe: {code:python} >>> import pyarrow as pa >>> import pandas as pd >>> pa.__version__ '4.0.1' >>> pd.__version__ '1.2.4' >>> pd.DataFrame({"str": ["a" * 16384] * (2**17 - >>> 1)}).reset_index().to_feather('/tmp/foo.feather') # Finishes successfully >>> pd.DataFrame({"str": ["a" * 16384] * (2**17 + >>> 1)}).reset_index().to_feather('/tmp/foo.feather') # Hangs while memory usage shoots up {code} > Pandas to_feather no longer works - runs out of memory > ------------------------------------------------------ > > Key: ARROW-13014 > URL: https://issues.apache.org/jira/browse/ARROW-13014 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 4.0.0, 4.0.1 > Environment: Linux > Reporter: Roland Swingler > Priority: Major > > Since upgrading to 4.0.1 writing to feather files with the pandas to_feather > method uses up far, far more memory. > For reference I have a dataframe that is around 10gb in size, 25 million > rows. Writing a feather file took around 3-4gb of memory in pyarrow versions > up to 3.0.0. As of 4.0.1 I don't know how much memory it will take to > successfully write - I tried running on a 120gb AWS machine, and that wasn't > sufficient. > I can't provide the dataframe, but I can give an outline of the types / sizes > of the columns: > size (bytes),type > 206663144,int64 > 206663144,int64 > 206663144,float64 > 206663144,float64 > 2882448709,object > 5813798687,object > 206663144,float64 > 206663144,int64 > 206663144,int64 > 206663144,int64 > 206663144,int64 > 206663144,float64 -- This message was sent by Atlassian Jira (v8.3.4#803005)