Hi Paul,
I think you must be doing something wrong with your class, the #do: is
implemented as streaming over the record one by one, never holding more than
one in memory.
This is what I tried:
'paul.csv' asFileReference writeStreamDo: [ :file|
ZnBufferedWriteStream on: file do: [ :out |
(NeoCSVWriter on: out) in: [ :writer |
writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
1 to: 1e7 do: [ :each |
writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
#(true false) atRandom } ] ] ] ].
This results in a 300Mb file:
$ ls -lah paul.csv
-rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
$ wc paul.csv
10000001 10000001 342781577 paul.csv
This is a selective read and collect (loads about 10K records):
Array streamContents: [ :out |
'paul.csv' asFileReference readStreamDo: [ :in |
(NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
reader skipHeader; addIntegerField; addSymbolField; addIntegerField;
addFieldConverter: [ :x | x = #true ].
reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ]
] ].
This worked fine on my MacBook Air, no memory problems. It takes a while to
parse that much data, of course.
Sven
> On 14 Nov 2014, at 19:08, Paul DeBruicker <[email protected]> wrote:
>
> Hi -
>
> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so). I'm
> not sure if its because of the size of the files or the code I've written to
> keep track of the domain objects I'm interested in, but I'm getting out of
> memory errors & crashes in Pharo 3 on Mac with the latest VM. I haven't
> checked other vms.
>
> I'm going to profile my own code and attempt to split the files manually for
> now to see what else it could be.
>
>
> Right now I'm doing something similar to
>
> |file reader|
> file:= '/path/to/file/myfile.csv' asFileReference readStream.
> reader: NeoCSVReader on: file
>
> reader
> recordClass: MyClass;
> skipHeader;
> addField: #myField:;
> ....
>
>
> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
> eachRecord].
> file close.
>
>
>
> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000
> lines at a time) or an easy way to do that ?
>
>
>
>
> Thanks
>
> Paul