OK then, you *can* read/process 300MB .csv files ;-) What does your CSV file look like, can you show a couple of lines ? You are using a custom record class of your own, what does that look like or do ? Maybe you can try using Array again ?
What percentage of records read do you keep ? In my example it was very small. Have you tried calculating your memory usage ? > On 14 Nov 2014, at 22:34, Paul DeBruicker <pdebr...@gmail.com> wrote: > > Yes. With the image & vm I'm having trouble with I get an array with 9,942 > elements in it. So its works as you'd expect. > > While processing the CSV file the image stays at about 60MB in RAM. > > > > > > > > > > Sven Van Caekenberghe-2 wrote >> Can you successfully run my example code ? >> >>> On 14 Nov 2014, at 22:03, Paul DeBruicker < > >> pdebruic@ > >> > wrote: >>> >>> Hi Sven, >>> >>> Thanks for taking a look and testing the NeoCSVReader portion for me. >>> You're right of course that there's something I'm doing that's slow. >>> But. >>> There is something I can't figure out yet. >>> >>> To provide a little more detail: >>> >>> When the 'csv reading' process completes successfully profiling shows >>> that >>> most of the time is spent in NeoCSVReader>>#peekChar and using >>> NeoCSVReader>>##addField: to convert a string to a DateAndTime. Dropping >>> the DateAndTime conversion speeds things up but doesn't stop it from >>> running >>> out of memory. >>> >>> I start the image with >>> >>> ./pharo-ui --memory 1000m myimage.image >>> >>> Splitting the CSV file helps: >>> ~1.5MB 5,000 lines = 1.2 seconds. >>> ~15MB 50,000 lines = 8 seconds. >>> ~30MB 100,000 lines = 16 seconds. >>> ~60MB 200,000 lines = 45 seconds. >>> >>> >>> It seems that when the CSV file crosses ~70MB in size things start going >>> haywire with performance, and leads to the out of memory condition. The >>> processing never ends. Sending "kill -SIGUSR1" prints a stack primarily >>> composed of: >>> >>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>> OutOfMemory class >>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) >>> OutOfMemory class >>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory >>> class >>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>> OutOfMemory class >>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) >>> OutOfMemory class >>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory >>> class >>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>> OutOfMemory class >>> >>> So it seems like its trying to signal that its out of memory after its >>> out >>> of memory which triggers another OutOfMemory error. So that's why >>> progress >>> stops. >>> >>> >>> ** Aside - OutOfMemory should probably be refactored to be able to signal >>> itself without taking up more memory, triggering itself infinitely. >>> Maybe >>> it & its signalling morph infrastructure would be good as a singleton ** >>> >>> >>> >>> I'm confused about why it runs out of memory. According to htop the >>> image >>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory' >>> condition. This Macbook Air laptop has 4GB, and has plenty of room for >>> the >>> image to grow. Also I've specified a 1,000MB image size when starting. >>> So >>> it should have plenty of room. Is there something I should check or a >>> flag >>> somewhere that prevents it from growing on a Mac? This is the latest >>> Pharo30 VM. >>> >>> >>> Thanks for helping me get to the bottom of this >>> >>> Paul >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Sven Van Caekenberghe-2 wrote >>>> Hi Paul, >>>> >>>> I think you must be doing something wrong with your class, the #do: is >>>> implemented as streaming over the record one by one, never holding more >>>> than one in memory. >>>> >>>> This is what I tried: >>>> >>>> 'paul.csv' asFileReference writeStreamDo: [ :file| >>>> ZnBufferedWriteStream on: file do: [ :out | >>>> (NeoCSVWriter on: out) in: [ :writer | >>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}. >>>> 1 to: 1e7 do: [ :each | >>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. >>>> #(true false) atRandom } ] ] ] ]. >>>> >>>> This results in a 300Mb file: >>>> >>>> $ ls -lah paul.csv >>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv >>>> $ wc paul.csv >>>> 10000001 10000001 342781577 paul.csv >>>> >>>> This is a selective read and collect (loads about 10K records): >>>> >>>> Array streamContents: [ :out | >>>> 'paul.csv' asFileReference readStreamDo: [ :in | >>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader | >>>> reader skipHeader; addIntegerField; addSymbolField; >>>> addIntegerField; >>>> addFieldConverter: [ :x | x = #true ]. >>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each >>>> ] >>>> ] ] ] ]. >>>> >>>> This worked fine on my MacBook Air, no memory problems. It takes a while >>>> to parse that much data, of course. >>>> >>>> Sven >>>> >>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker < >>> >>>> pdebruic@ >>> >>>> > wrote: >>>>> >>>>> Hi - >>>>> >>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so). >>>>> I'm not sure if its because of the size of the files or the code I've >>>>> written to keep track of the domain objects I'm interested in, but I'm >>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the >>>>> latest >>>>> VM. I haven't checked other vms. >>>>> >>>>> I'm going to profile my own code and attempt to split the files >>>>> manually >>>>> for now to see what else it could be. >>>>> >>>>> >>>>> Right now I'm doing something similar to >>>>> >>>>> |file reader| >>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream. >>>>> reader: NeoCSVReader on: file >>>>> >>>>> reader >>>>> recordClass: MyClass; >>>>> skipHeader; >>>>> addField: #myField:; >>>>> .... >>>>> >>>>> >>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt: >>>>> eachRecord]. >>>>> file close. >>>>> >>>>> >>>>> >>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g. >>>>> 1000 >>>>> lines at a time) or an easy way to do that ? >>>>> >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Paul >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html >>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > > > > > -- > View this message in context: > http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.