Sorry I've been away at reinvent. Just tried out what's currently on master (with the chunked change that looks like it has merged). I'll do the break down of the different parts later but as a high level look at just running the same script as described above these are the numbers
https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing Looks to me like the change has definitely helped, with memory usage dropping to around 300mb, although the usage doesn't really change that much once chunk size is > 1000 Daniel Harper http://djhworld.github.io On Fri, 23 Nov 2018 at 10:58, Sebastien Binet <bi...@cern.ch> wrote: > On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > That seems buggy then. There is only 4.125 bytes of overhead per > > string value on average (a 32-bit offset, plus a valid bit) > > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharpe...@gmail.com> > > wrote: > > > > > > Uncompressed > > > > > > $ ls -la concurrent_streams.csv > > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv > > > > > > $ wc -l concurrent_streams.csv > > > 1007481 concurrent_streams.csv > > > > > > > > > Daniel Harper > > > http://djhworld.github.io > > > > > > > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > I'm curious how the file is only 100MB if it's producing ~6GB of > > > > strings in memory. Is it compressed? > > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <djharpe...@gmail.com> > > > > wrote: > > > > > > > > > > Thanks, > > > > > > > > > > I've tried the new code and that seems to have shaved about 1GB of > > memory > > > > > off, so the heap is about 8.84GB now, here is the updated pprof > > output > > > > > https://i.imgur.com/itOHqBf.png > > > > > > > > > > It looks like the majority of allocations are in the > > memory.GoAllocator > > > > > > > > > > (pprof) top > > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total > > > > > Showing top 10 nodes out of 41 > > > > > flat flat% sum% cum cum% > > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91% > > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate > > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97% > > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer > (inline) > > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07% > > > > > github.com/apache/arrow/go/arrow/array.NewData > > > > > 0.83GB 9.38% 93.33% 0.83GB 9.38% > > > > > github.com/apache/arrow/go/arrow/array.NewStringData > > > > > 0.33GB 3.69% 97.02% 1.31GB 14.79% > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData > > > > > 0.18GB 2.04% 99.06% 0.18GB 2.04% > > > > > github.com/apache/arrow/go/arrow/array.NewChunked > > > > > 0.07GB 0.78% 99.85% 0.07GB 0.78% > > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data > > > > > 0.01GB 0.15% 100% 0.21GB 2.37% > > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData > > > > > 0 0% 100% 6GB 67.91% > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append > > > > > 0 0% 100% 4.03GB 45.54% > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve > > > > > > > > > > > > > > > I'm a bit busy at the moment but I'll probably repeat the same test > > on > > > > the > > > > > other Arrow implementations (e.g. Java) to see if they allocate a > > similar > > > > > amount. > > > > I've implemented chunking over there: > > - https://github.com/apache/arrow/pull/3019 > > could you try with a couple of chunking values? > e.g.: > - csv.WithChunk(-1): reads the whole file into memory, creates one big > record > - csv.WithChunk(nrows/10): creates 10 records > > also, it would be great to try to disentangle the memory usage of the "CSV > reading part" from the "Table creation" one: > - have some perf numbers w/o storing all these Records into a []Record > slice, > - have some perf numbers w/ only storing these Records into a []Record > slice, > - have some perf numbers w/ storing the records into the slice + creating > the Table. > > hth, > -s >