[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992058#comment-16992058 ]
Bogdan Klichuk commented on ARROW-7305: --------------------------------------- Seems like its transformation of pandas to pyarrow.Table. If you transform manually {code:java} table = pyarrow.Table.from_pandas(data) {code} you'll see its the thing doing the memory spike, writing this table to parquet looks light. > [Python] High memory usage writing pyarrow.Table with large strings to parquet > ------------------------------------------------------------------------------ > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python > Affects Versions: 0.15.1 > Environment: Mac OSX > Reporter: Bogdan Klichuk > Priority: Major > Labels: parquet > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but > I want to focus on this specific case. > Here's the footprint after running using memory profiler: > {code:java} > Line # Mem usage Increment Line Contents > ================================================ > 4 48.9 MiB 48.9 MiB @profile > 5 def test(): > 6 143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') > 7 498.6 MiB 354.9 MiB data.to_parquet('out.parquet') > {code} > Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)