Yes. With the image & vm I'm having trouble with I get an array with 9,942
elements in it. So its works as you'd expect.
While processing the CSV file the image stays at about 60MB in RAM.
Sven Van Caekenberghe-2 wrote
> Can you successfully run my example code ?
>
>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
> pdebruic@
> > wrote:
>>
>> Hi Sven,
>>
>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>> You're right of course that there's something I'm doing that's slow.
>> But.
>> There is something I can't figure out yet.
>>
>> To provide a little more detail:
>>
>> When the 'csv reading' process completes successfully profiling shows
>> that
>> most of the time is spent in NeoCSVReader>>#peekChar and using
>> NeoCSVReader>>##addField: to convert a string to a DateAndTime. Dropping
>> the DateAndTime conversion speeds things up but doesn't stop it from
>> running
>> out of memory.
>>
>> I start the image with
>>
>> ./pharo-ui --memory 1000m myimage.image
>>
>> Splitting the CSV file helps:
>> ~1.5MB 5,000 lines = 1.2 seconds.
>> ~15MB 50,000 lines = 8 seconds.
>> ~30MB 100,000 lines = 16 seconds.
>> ~60MB 200,000 lines = 45 seconds.
>>
>>
>> It seems that when the CSV file crosses ~70MB in size things start going
>> haywire with performance, and leads to the out of memory condition. The
>> processing never ends. Sending "kill -SIGUSR1" prints a stack primarily
>> composed of:
>>
>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>> OutOfMemory class
>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>> OutOfMemory class
>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>> class
>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>> OutOfMemory class
>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>> OutOfMemory class
>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>> class
>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>> OutOfMemory class
>>
>> So it seems like its trying to signal that its out of memory after its
>> out
>> of memory which triggers another OutOfMemory error. So that's why
>> progress
>> stops.
>>
>>
>> ** Aside - OutOfMemory should probably be refactored to be able to signal
>> itself without taking up more memory, triggering itself infinitely.
>> Maybe
>> it & its signalling morph infrastructure would be good as a singleton **
>>
>>
>>
>> I'm confused about why it runs out of memory. According to htop the
>> image
>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>> the
>> image to grow. Also I've specified a 1,000MB image size when starting.
>> So
>> it should have plenty of room. Is there something I should check or a
>> flag
>> somewhere that prevents it from growing on a Mac? This is the latest
>> Pharo30 VM.
>>
>>
>> Thanks for helping me get to the bottom of this
>>
>> Paul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Sven Van Caekenberghe-2 wrote
>>> Hi Paul,
>>>
>>> I think you must be doing something wrong with your class, the #do: is
>>> implemented as streaming over the record one by one, never holding more
>>> than one in memory.
>>>
>>> This is what I tried:
>>>
>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>> ZnBufferedWriteStream on: file do: [ :out |
>>> (NeoCSVWriter on: out) in: [ :writer |
>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>> 1 to: 1e7 do: [ :each |
>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>>> #(true false) atRandom } ] ] ] ].
>>>
>>> This results in a 300Mb file:
>>>
>>> $ ls -lah paul.csv
>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>> $ wc paul.csv
>>> 10000001 10000001 342781577 paul.csv
>>>
>>> This is a selective read and collect (loads about 10K records):
>>>
>>> Array streamContents: [ :out |
>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>> reader skipHeader; addIntegerField; addSymbolField;
>>> addIntegerField;
>>> addFieldConverter: [ :x | x = #true ].
>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>> ]
>>> ] ] ] ].
>>>
>>> This worked fine on my MacBook Air, no memory problems. It takes a while
>>> to parse that much data, of course.
>>>
>>> Sven
>>>
>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>
>>> pdebruic@
>>
>>> > wrote:
>>>>
>>>> Hi -
>>>>
>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).
>>>> I'm not sure if its because of the size of the files or the code I've
>>>> written to keep track of the domain objects I'm interested in, but I'm
>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>> latest
>>>> VM. I haven't checked other vms.
>>>>
>>>> I'm going to profile my own code and attempt to split the files
>>>> manually
>>>> for now to see what else it could be.
>>>>
>>>>
>>>> Right now I'm doing something similar to
>>>>
>>>> |file reader|
>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>> reader: NeoCSVReader on: file
>>>>
>>>> reader
>>>> recordClass: MyClass;
>>>> skipHeader;
>>>> addField: #myField:;
>>>> ....
>>>>
>>>>
>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>> eachRecord].
>>>> file close.
>>>>
>>>>
>>>>
>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>> 1000
>>>> lines at a time) or an easy way to do that ?
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Paul
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
--
View this message in context:
http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.