On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <wesmck...@gmail.com> wrote:
> That seems buggy then. There is only 4.125 bytes of overhead per > string value on average (a 32-bit offset, plus a valid bit) > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharpe...@gmail.com> > wrote: > > > > Uncompressed > > > > $ ls -la concurrent_streams.csv > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv > > > > $ wc -l concurrent_streams.csv > > 1007481 concurrent_streams.csv > > > > > > Daniel Harper > > http://djhworld.github.io > > > > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I'm curious how the file is only 100MB if it's producing ~6GB of > > > strings in memory. Is it compressed? > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <djharpe...@gmail.com> > > > wrote: > > > > > > > > Thanks, > > > > > > > > I've tried the new code and that seems to have shaved about 1GB of > memory > > > > off, so the heap is about 8.84GB now, here is the updated pprof > output > > > > https://i.imgur.com/itOHqBf.png > > > > > > > > It looks like the majority of allocations are in the > memory.GoAllocator > > > > > > > > (pprof) top > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total > > > > Showing top 10 nodes out of 41 > > > > flat flat% sum% cum cum% > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91% > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97% > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline) > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07% > > > > github.com/apache/arrow/go/arrow/array.NewData > > > > 0.83GB 9.38% 93.33% 0.83GB 9.38% > > > > github.com/apache/arrow/go/arrow/array.NewStringData > > > > 0.33GB 3.69% 97.02% 1.31GB 14.79% > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData > > > > 0.18GB 2.04% 99.06% 0.18GB 2.04% > > > > github.com/apache/arrow/go/arrow/array.NewChunked > > > > 0.07GB 0.78% 99.85% 0.07GB 0.78% > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data > > > > 0.01GB 0.15% 100% 0.21GB 2.37% > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData > > > > 0 0% 100% 6GB 67.91% > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append > > > > 0 0% 100% 4.03GB 45.54% > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve > > > > > > > > > > > > I'm a bit busy at the moment but I'll probably repeat the same test > on > > > the > > > > other Arrow implementations (e.g. Java) to see if they allocate a > similar > > > > amount. > I've implemented chunking over there: - https://github.com/apache/arrow/pull/3019 could you try with a couple of chunking values? e.g.: - csv.WithChunk(-1): reads the whole file into memory, creates one big record - csv.WithChunk(nrows/10): creates 10 records also, it would be great to try to disentangle the memory usage of the "CSV reading part" from the "Table creation" one: - have some perf numbers w/o storing all these Records into a []Record slice, - have some perf numbers w/ only storing these Records into a []Record slice, - have some perf numbers w/ storing the records into the slice + creating the Table. hth, -s