Uncompressed $ ls -la concurrent_streams.csv -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
$ wc -l concurrent_streams.csv 1007481 concurrent_streams.csv Daniel Harper http://djhworld.github.io On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmck...@gmail.com> wrote: > I'm curious how the file is only 100MB if it's producing ~6GB of > strings in memory. Is it compressed? > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <djharpe...@gmail.com> > wrote: > > > > Thanks, > > > > I've tried the new code and that seems to have shaved about 1GB of memory > > off, so the heap is about 8.84GB now, here is the updated pprof output > > https://i.imgur.com/itOHqBf.png > > > > It looks like the majority of allocations are in the memory.GoAllocator > > > > (pprof) top > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total > > Showing top 10 nodes out of 41 > > flat flat% sum% cum cum% > > 4.24GB 47.91% 47.91% 4.24GB 47.91% > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate > > 2.12GB 23.97% 71.88% 2.12GB 23.97% > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline) > > 1.07GB 12.07% 83.95% 1.07GB 12.07% > > github.com/apache/arrow/go/arrow/array.NewData > > 0.83GB 9.38% 93.33% 0.83GB 9.38% > > github.com/apache/arrow/go/arrow/array.NewStringData > > 0.33GB 3.69% 97.02% 1.31GB 14.79% > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData > > 0.18GB 2.04% 99.06% 0.18GB 2.04% > > github.com/apache/arrow/go/arrow/array.NewChunked > > 0.07GB 0.78% 99.85% 0.07GB 0.78% > > github.com/apache/arrow/go/arrow/array.NewInt64Data > > 0.01GB 0.15% 100% 0.21GB 2.37% > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData > > 0 0% 100% 6GB 67.91% > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append > > 0 0% 100% 4.03GB 45.54% > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve > > > > > > I'm a bit busy at the moment but I'll probably repeat the same test on > the > > other Arrow implementations (e.g. Java) to see if they allocate a similar > > amount. > > > > > > Daniel Harper > > http://djhworld.github.io > > > > > > On Mon, 19 Nov 2018 at 10:17, Sebastien Binet <bi...@cern.ch> wrote: > > > > > hi Daniel, > > > On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <djharpe...@gmail.com> > > > wrote: > > > > > > > Sorry just realised SVG doesn't work. > > > > > > > > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png > > > > > > > > > > > > Daniel Harper > > > > http://djhworld.github.io > > > > > > > > > > > > On Sun, 18 Nov 2018 at 21:07, Daniel Harper <djharpe...@gmail.com> > > > wrote: > > > > > > > > > Wasn't sure where the best place to discuss this, but I've noticed > that > > > > > when running the following piece of code > > > > > > > > > > https://play.golang.org/p/SKkqPWoHPPS > > > > > > > > > > On a CSV files that contains roughly 1 million records (about > 100mb of > > > > > data), the memory usage of the process leaps to about 9.1GB > > > > > > > > > > The records look something like this > > > > > > > > > > > > > > > > > > > > > > > "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live" > > > > > > > > > > > > > > > > > > "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand" > > > > > > > > > > I've attached a pprof output of the process. > > > > > > > > > > From the looks of it the heavy use of _strings_ might be where > most of > > > > the > > > > > memory is going. > > > > > > > > > > Is this expected? I'm new to the code, happy to help where > possible! > > > > > > > > > > it's somewhat expected. > > > > > > you use `io.ReadFile` to get your data. > > > this will read the whole file in memory and stick it there: so there's > > > that. > > > for much bigger files, I would recommend using `os.Open`. > > > > > > also, you don't release the individual records once passed to the > table, so > > > you have a memory leak. > > > here is my current attempt: > > > - https://play.golang.org/p/ns3GJW6Wx3T > > > > > > finally, as I was alluding to on the #data-science slack channel, > right now > > > Go arrow/csv will create a new Record for each row in the incoming CSV > > > file. > > > so you get a bunch of overhead for every row/record. > > > > > > a much more efficient way would be to chunk `n` rows into a single > Record. > > > an even more efficient way would be to create a dedicated csv.table > type > > > that implements array.Table (as it seems you're interested in using > that > > > interface) but only reads the incoming CSV file piecewise (ie: > implementing > > > the chunking I was alluding to above but w/o having to load the whole > > > []Record slice.) > > > > > > as a first step to improve this issue, implementing chunking would > already > > > shave off a bunch of overhead. > > > > > > -s > > > >