Bogdan Klichuk created ARROW-7305: ------------------------------------- Summary: High memory usage writing pyarrow.Table to parquet Key: ARROW-7305 URL: https://issues.apache.org/jira/browse/ARROW-7305 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.15.1 Environment: Mac OSX Reporter: Bogdan Klichuk
My case of datasets stored is specific. I have large strings (1-100MB each). Let's take for example a single row. 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb. With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case. Here's the footprint after running using memory profiler: {code:java} Line # Mem usage Increment Line Contents ================================================ 4 48.9 MiB 48.9 MiB @profile 5 def test(): 6 143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') 7 498.6 MiB 354.9 MiB data.to_parquet('out.parquet') {code} Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)