> On 14 Nov 2014, at 23:14, Paul DeBruicker <pdebr...@gmail.com> wrote: > > Hi Sven > > Yes, like I said earlier, after your first email, that I think its not a > problem with NeoCSV as with what I'm doing and an out of memory condition. > > Have you ever seen a stack after sending kill -SIGUSR1 that looks like this: > > output file stack is full. > output file stack is full. > output file stack is full. > output file stack is full. > output file stack is full. > .... > > > What does that mean?
I don't know, but I think that you are really out of memory. BTW, I think that setting no flags is better, memory will expand maximally then. I think the useful maximum is closer to 1GB than 2GB. > Answers to your questions below. It is difficult to follow what you are doing exactly, but I think that you underestimate how much memory a parsed, structured/nested object uses. Taking the second line of your example, the 20+ fields, with 3 DateAndTimes, easily cost between 512 and 1024 bytes per record. That would limit you to between 1M and 2M records. I tried this: Array streamContents: [ :data | 5e2 timesRepeat: [ data nextPut: (Array streamContents: [ :out | 20 timesRepeat: [ out nextPut: Character alphabet ]. 3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ]. it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes very long. Good luck, if you can solve this, please tell us how you did it. > Thanks again for helping me out > > > > Sven Van Caekenberghe-2 wrote >> OK then, you *can* read/process 300MB .csv files ;-) >> >> What does your CSV file look like, can you show a couple of lines ? >> >> here are 2 lines + a header: >> >> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate" >> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET >> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give >> residents proper treatment to prevent new bed (pressure) sores or heal >> existing bed sores.","D","Deficient, Provider has date of >> correction","2013-10-10",1,"Y","N","2014-01-01" >> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET >> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure >> that each resident who enters the nursing home without a catheter is not >> given a catheter, unless medically necessary, and that incontinent >> patients receive proper services to prevent urinary tract infections and >> restore normal bladder functions.","D","Deficient, Provider has date of >> correction","2013-10-10",1,"Y","N","2014-01-01" >> >> >> You are using a custom record class of your own, what does that look like >> or do ? >> >> A custom record class. This is all publicly available data but I'm >> keeping track of the performance of US based health care providers during >> their annual inspections. So the records are notes of a deficiency during >> the inspection and I'm keeping those notes in a collection in an instance >> of the health care provider's class. The custom record class just >> converts the CSV record to objects (Integers, Strings, DateAndTime) and >> then gets stuffed in the health care provider's deficiency history >> OrderedCollection (which has about 100 items). Again I don't think its >> what I'm doing as much as the image isn't growing when it needs to. >> >> >> >> >> Maybe you can try using Array again ? >> >> I've attempted to do it where I parse and convert the entire CSV into >> domain objects then add them to the image and the parsing works fine, but >> the system runs out of resources during the update phase. >> >> >> What percentage of records read do you keep ? In my example it was very >> small. Have you tried calculating your memory usage ? >> >> >> I'm keeping some data from every record, but it doesn't load more than >> 500MB of the data before falling over. I am not attempting to load the >> 9GB of CSV files into one image. For 95% of the records in the CSV file >> 20 of the 22 columns of the data is the same from file to file, just a >> 'published date' and a 'time to expiration' date changes. Each file >> covers a month, with about 500k deficiencies. Each month some >> deficiencies are added to the file and some are resolved. So the total >> number of deficiencies in the image is about 500k. Of those records that >> don't expire in a given month I'm adding the published date to a >> collection of published dates for the record and also adding the "time to >> expiration" to a collection of those to record what was made public and >> letting the rest of the data get GC'd. I don't only load those two >> records because the other fields of the record in the CSV could change. >> >> I have not calculated the memory usage for the collection because I >> thought it would have no problem fitting in the 2GB of RAM I have on this >> machine. >> >> >> >>> On 14 Nov 2014, at 22:34, Paul DeBruicker < > >> pdebruic@ > >> > wrote: >>> >>> Yes. With the image & vm I'm having trouble with I get an array with >>> 9,942 >>> elements in it. So its works as you'd expect. >>> >>> While processing the CSV file the image stays at about 60MB in RAM. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Sven Van Caekenberghe-2 wrote >>>> Can you successfully run my example code ? >>>> >>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker < >>> >>>> pdebruic@ >>> >>>> > wrote: >>>>> >>>>> Hi Sven, >>>>> >>>>> Thanks for taking a look and testing the NeoCSVReader portion for me. >>>>> You're right of course that there's something I'm doing that's slow. >>>>> But. >>>>> There is something I can't figure out yet. >>>>> >>>>> To provide a little more detail: >>>>> >>>>> When the 'csv reading' process completes successfully profiling shows >>>>> that >>>>> most of the time is spent in NeoCSVReader>>#peekChar and using >>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime. >>>>> Dropping >>>>> the DateAndTime conversion speeds things up but doesn't stop it from >>>>> running >>>>> out of memory. >>>>> >>>>> I start the image with >>>>> >>>>> ./pharo-ui --memory 1000m myimage.image >>>>> >>>>> Splitting the CSV file helps: >>>>> ~1.5MB 5,000 lines = 1.2 seconds. >>>>> ~15MB 50,000 lines = 8 seconds. >>>>> ~30MB 100,000 lines = 16 seconds. >>>>> ~60MB 200,000 lines = 45 seconds. >>>>> >>>>> >>>>> It seems that when the CSV file crosses ~70MB in size things start >>>>> going >>>>> haywire with performance, and leads to the out of memory condition. >>>>> The >>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack >>>>> primarily >>>>> composed of: >>>>> >>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>>>> OutOfMemory class >>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) >>>>> OutOfMemory class >>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) >>>>> OutOfMemory >>>>> class >>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>>>> OutOfMemory class >>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) >>>>> OutOfMemory class >>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) >>>>> OutOfMemory >>>>> class >>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) >>>>> OutOfMemory class >>>>> >>>>> So it seems like its trying to signal that its out of memory after its >>>>> out >>>>> of memory which triggers another OutOfMemory error. So that's why >>>>> progress >>>>> stops. >>>>> >>>>> >>>>> ** Aside - OutOfMemory should probably be refactored to be able to >>>>> signal >>>>> itself without taking up more memory, triggering itself infinitely. >>>>> Maybe >>>>> it & its signalling morph infrastructure would be good as a singleton >>>>> ** >>>>> >>>>> >>>>> >>>>> I'm confused about why it runs out of memory. According to htop the >>>>> image >>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory' >>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for >>>>> the >>>>> image to grow. Also I've specified a 1,000MB image size when starting. >>>>> So >>>>> it should have plenty of room. Is there something I should check or a >>>>> flag >>>>> somewhere that prevents it from growing on a Mac? This is the latest >>>>> Pharo30 VM. >>>>> >>>>> >>>>> Thanks for helping me get to the bottom of this >>>>> >>>>> Paul >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Sven Van Caekenberghe-2 wrote >>>>>> Hi Paul, >>>>>> >>>>>> I think you must be doing something wrong with your class, the #do: is >>>>>> implemented as streaming over the record one by one, never holding >>>>>> more >>>>>> than one in memory. >>>>>> >>>>>> This is what I tried: >>>>>> >>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file| >>>>>> ZnBufferedWriteStream on: file do: [ :out | >>>>>> (NeoCSVWriter on: out) in: [ :writer | >>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}. >>>>>> 1 to: 1e7 do: [ :each | >>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 >>>>>> atRandom. >>>>>> #(true false) atRandom } ] ] ] ]. >>>>>> >>>>>> This results in a 300Mb file: >>>>>> >>>>>> $ ls -lah paul.csv >>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv >>>>>> $ wc paul.csv >>>>>> 10000001 10000001 342781577 paul.csv >>>>>> >>>>>> This is a selective read and collect (loads about 10K records): >>>>>> >>>>>> Array streamContents: [ :out | >>>>>> 'paul.csv' asFileReference readStreamDo: [ :in | >>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader | >>>>>> reader skipHeader; addIntegerField; addSymbolField; >>>>>> addIntegerField; >>>>>> addFieldConverter: [ :x | x = #true ]. >>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each >>>>>> ] >>>>>> ] ] ] ]. >>>>>> >>>>>> This worked fine on my MacBook Air, no memory problems. It takes a >>>>>> while >>>>>> to parse that much data, of course. >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker < >>>>> >>>>>> pdebruic@ >>>>> >>>>>> > wrote: >>>>>>> >>>>>>> Hi - >>>>>>> >>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or >>>>>>> so). >>>>>>> I'm not sure if its because of the size of the files or the code I've >>>>>>> written to keep track of the domain objects I'm interested in, but >>>>>>> I'm >>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the >>>>>>> latest >>>>>>> VM. I haven't checked other vms. >>>>>>> >>>>>>> I'm going to profile my own code and attempt to split the files >>>>>>> manually >>>>>>> for now to see what else it could be. >>>>>>> >>>>>>> >>>>>>> Right now I'm doing something similar to >>>>>>> >>>>>>> |file reader| >>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream. >>>>>>> reader: NeoCSVReader on: file >>>>>>> >>>>>>> reader >>>>>>> recordClass: MyClass; >>>>>>> skipHeader; >>>>>>> addField: #myField:; >>>>>>> .... >>>>>>> >>>>>>> >>>>>>> reader do:[:eachRecord | self >>>>>>> seeIfRecordIsInterestingAndIfSoKeepIt: >>>>>>> eachRecord]. >>>>>>> file close. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g. >>>>>>> 1000 >>>>>>> lines at a time) or an easy way to do that ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Paul >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html >>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html >>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > > > > > -- > View this message in context: > http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.