Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe Fri, 14 Nov 2014 12:00:53 -0800

Hi Paul,

I think you must be doing something wrong with your class, the #do: is 
implemented as streaming over the record one by one, never holding more than 
one in memory.


This is what I tried:

'paul.csv' asFileReference writeStreamDo: [ :file|
  ZnBufferedWriteStream on: file do: [ :out |
    (NeoCSVWriter on: out) in: [ :writer |
      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
      1 to: 1e7 do: [ :each |
        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. 
#(true false) atRandom } ] ] ] ].

This results in a 300Mb file:

$ ls -lah paul.csv 
-rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
$ wc paul.csv 
 10000001 10000001 342781577 paul.csv

This is a selective read and collect (loads about 10K records):

Array streamContents: [ :out |
  'paul.csv' asFileReference readStreamDo: [ :in |
    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
      reader skipHeader; addIntegerField; addSymbolField; addIntegerField; 
addFieldConverter: [ :x | x = #true ].
      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ] 
] ].

This worked fine on my MacBook Air, no memory problems. It takes a while to 
parse that much data, of course.

Sven

> On 14 Nov 2014, at 19:08, Paul DeBruicker <[email protected]> wrote:
> 
> Hi -
> 
> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).  I'm 
> not sure if its because of the size of the files or the code I've written to 
> keep track of the domain objects I'm interested in, but I'm getting out of 
> memory errors & crashes in Pharo 3 on Mac with the latest VM.  I haven't 
> checked other vms.  
> 
> I'm going to profile my own code and attempt to split the files manually for 
> now to see what else it could be. 
> 
> 
> Right now I'm doing something similar to
> 
>       |file reader|
>       file:= '/path/to/file/myfile.csv' asFileReference readStream.
>       reader: NeoCSVReader on: file
> 
>       reader
>               recordClass: MyClass; 
>               skipHeader;
>               addField: #myField:;
>               ....
>       
> 
>       reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt: 
> eachRecord].
>       file close.
> 
> 
> 
> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000 
> lines at a time) or an easy way to do that ?
> 
> 
> 
> 
> Thanks
> 
> Paul

Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Reply via email to