[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000893#comment-17000893 ]
Bogdan Klichuk commented on ARROW-7305: --------------------------------------- I have tried this in ubuntu docker and results for 0.14.1 vs 0.15.1 are pretty interesting. 0.14.1: {code:java} Line # Mem usage Increment Line Contents ================================================ 4 50.5 MiB 50.5 MiB @profile 5 def do(): 6 99.9 MiB 49.4 MiB df = pd.read_csv('50mb.csv') 7 112.1 MiB 12.1 MiB df.to_parquet('test.parquet'){code} 0.15.1: {code:java} Line # Mem usage Increment Line Contents ================================================ 4 50.5 MiB 50.5 MiB @profile 5 def do(): 6 100.0 MiB 49.4 MiB df = pd.read_csv('50mb.csv') 7 401.4 MiB 301.4 MiB df.to_parquet('test.parquet') {code} which besides the fact that 0.14.1 does indeed behave better on non-mac, also shows that 0.15.1 requires much more memory to write. > [Python] High memory usage writing pyarrow.Table with large strings to parquet > ------------------------------------------------------------------------------ > > Key: ARROW-7305 > URL: https://issues.apache.org/jira/browse/ARROW-7305 > Project: Apache Arrow > Issue Type: Task > Components: Python > Affects Versions: 0.15.1 > Environment: Mac OSX > Reporter: Bogdan Klichuk > Priority: Major > Labels: parquet > Attachments: 50mb.csv.gz > > > My case of datasets stored is specific. I have large strings (1-100MB each). > Let's take for example a single row. > 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. > When I read this csv with pandas and then dump to parquet, my script consumes > 10x of the 43mb. > With increasing amount of such rows memory footprint overhead diminishes, but > I want to focus on this specific case. > Here's the footprint after running using memory profiler: > {code:java} > Line # Mem usage Increment Line Contents > ================================================ > 4 48.9 MiB 48.9 MiB @profile > 5 def test(): > 6 143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') > 7 498.6 MiB 354.9 MiB data.to_parquet('out.parquet') > {code} > Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)