Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe Fri, 14 Nov 2014 13:45:14 -0800

OK then, you *can* read/process 300MB .csv files ;-)

What does your CSV file look like, can you show a couple of lines ?
You are using a custom record class of your own, what does that look like or do 
?
Maybe you can try using Array again ?


What percentage of records read do you keep ? In my example it was very small. 
Have you tried calculating your memory usage ? 

> On 14 Nov 2014, at 22:34, Paul DeBruicker <pdebr...@gmail.com> wrote:
> 
> Yes. With the image & vm I'm having trouble with I get an array with 9,942
> elements in it.  So its works as you'd expect.
> 
> While processing the CSV file the image stays at about 60MB in RAM.  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sven Van Caekenberghe-2 wrote
>> Can you successfully run my example code ?
>> 
>>> On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;
> 
>> pdebruic@
> 
>> &gt; wrote:
>>> 
>>> Hi Sven,
>>> 
>>> Thanks for taking a look and testing the NeoCSVReader portion for me. 
>>> You're right of course that there's something I'm doing that's slow. 
>>> But. 
>>> There is something I can't figure out yet.  
>>> 
>>> To provide a little more detail:
>>> 
>>> When the 'csv reading' process completes successfully profiling shows
>>> that
>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>> running
>>> out of memory.  
>>> 
>>> I start the image with 
>>> 
>>> ./pharo-ui --memory 1000m myimage.image   
>>> 
>>> Splitting the CSV file helps:
>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>> ~15MB   50,000 lines = 8 seconds.
>>> ~30MB   100,000 lines = 16 seconds.
>>> ~60MB   200,000 lines  = 45 seconds.
>>> 
>>> 
>>> It seems that when the CSV file crosses ~70MB in size things start going
>>> haywire with performance, and leads to the out of memory condition.  The
>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
>>> composed of:
>>> 
>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 
>>> So it seems like its trying to signal that its out of memory after its
>>> out
>>> of memory which triggers another OutOfMemory error.  So that's why
>>> progress
>>> stops.  
>>> 
>>> 
>>> ** Aside - OutOfMemory should probably be refactored to be able to signal
>>> itself without taking up more memory, triggering itself infinitely. 
>>> Maybe
>>> it & its signalling morph infrastructure would be good as a singleton **
>>> 
>>> 
>>> 
>>> I'm confused about why it runs out of memory.  According to htop the
>>> image
>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>> the
>>> image to grow.  Also I've specified a 1,000MB image size when starting. 
>>> So
>>> it should have plenty of room.  Is there something I should check or a
>>> flag
>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>> Pharo30 VM.  
>>> 
>>> 
>>> Thanks for helping me get to the bottom of this
>>> 
>>> Paul
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sven Van Caekenberghe-2 wrote
>>>> Hi Paul,
>>>> 
>>>> I think you must be doing something wrong with your class, the #do: is
>>>> implemented as streaming over the record one by one, never holding more
>>>> than one in memory.
>>>> 
>>>> This is what I tried:
>>>> 
>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>   (NeoCSVWriter on: out) in: [ :writer |
>>>>     writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>     1 to: 1e7 do: [ :each |
>>>>       writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>>>> #(true false) atRandom } ] ] ] ].
>>>> 
>>>> This results in a 300Mb file:
>>>> 
>>>> $ ls -lah paul.csv 
>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>> $ wc paul.csv 
>>>> 10000001 10000001 342781577 paul.csv
>>>> 
>>>> This is a selective read and collect (loads about 10K records):
>>>> 
>>>> Array streamContents: [ :out |
>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>   (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>     reader skipHeader; addIntegerField; addSymbolField;
>>>> addIntegerField;
>>>> addFieldConverter: [ :x | x = #true ].
>>>>     reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>> ]
>>>> ] ] ] ].
>>>> 
>>>> This worked fine on my MacBook Air, no memory problems. It takes a while
>>>> to parse that much data, of course.
>>>> 
>>>> Sven
>>>> 
>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>>> 
>>>> pdebruic@
>>> 
>>>> &gt; wrote:
>>>>> 
>>>>> Hi -
>>>>> 
>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so). 
>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>> written to keep track of the domain objects I'm interested in, but I'm
>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>> latest
>>>>> VM.  I haven't checked other vms.  
>>>>> 
>>>>> I'm going to profile my own code and attempt to split the files
>>>>> manually
>>>>> for now to see what else it could be. 
>>>>> 
>>>>> 
>>>>> Right now I'm doing something similar to
>>>>> 
>>>>>   |file reader|
>>>>>   file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>   reader: NeoCSVReader on: file
>>>>> 
>>>>>   reader
>>>>>           recordClass: MyClass; 
>>>>>           skipHeader;
>>>>>           addField: #myField:;
>>>>>           ....
>>>>>   
>>>>> 
>>>>>   reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>> eachRecord].
>>>>>   file close.
>>>>> 
>>>>> 
>>>>> 
>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>> 1000
>>>>> lines at a time) or an easy way to do that ?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Paul
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.

Re: [Pharo-users] running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Reply via email to