hi Daniel, On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharpe...@gmail.com> wrote:
> Sorry just realised SVG doesn't work. > > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png > > > Daniel Harper > http://djhworld.github.io > > > On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharpe...@gmail.com> wrote: > > > Wasn't sure where the best place to discuss this, but I've noticed that > > when running the following piece of code > > > > https://play.golang.org/p/SKkqPWoHPPS > > > > On a CSV files that contains roughly 1 million records (about 100mb of > > data), the memory usage of the process leaps to about 9.1GB > > > > The records look something like this > > > > > > > "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live" > > > > > "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand" > > > > I've attached a pprof output of the process. > > > > From the looks of it the heavy use of _strings_ might be where most of > the > > memory is going. > > > > Is this expected? I'm new to the code, happy to help where possible! > it's somewhat expected. you use `io.ReadFile` to get your data. this will read the whole file in memory and stick it there: so there's that. for much bigger files, I would recommend using `os.Open`. also, you don't release the individual records once passed to the table, so you have a memory leak. here is my current attempt: - https://play.golang.org/p/ns3GJW6Wx3T finally, as I was alluding to on the #data-science slack channel, right now Go arrow/csv will create a new Record for each row in the incoming CSV file. so you get a bunch of overhead for every row/record. a much more efficient way would be to chunk `n` rows into a single Record. an even more efficient way would be to create a dedicated csv.table type that implements array.Table (as it seems you're interested in using that interface) but only reads the incoming CSV file piecewise (ie: implementing the chunking I was alluding to above but w/o having to load the whole []Record slice.) as a first step to improve this issue, implementing chunking would already shave off a bunch of overhead. -s