On Tue, Dec 4, 2018 at 10:23 PM Daniel Harper <djharpe...@gmail.com> wrote:
> Sorry I've been away at reinvent. > > Just tried out what's currently on master (with the chunked change that > looks like it has merged). I'll do the break down of the different parts > later but as a high level look at just running the same script as described > above these are the numbers > > > https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing > > > Looks to me like the change has definitely helped, with memory usage > dropping to around 300mb, although the usage doesn't really change that > much once chunk size is > 1000 > good. you might want to try with a chunk size of -1 (this loads the whole CSV file into memory in one fell swoop.) also, there's this PR wich should probably also reduce the memory pressure: - https://github.com/apache/arrow/pull/3073 cheers, -s > > > > > Daniel Harper > http://djhworld.github.io > > > On Fri, 23 Nov 2018 at 10:58, Sebastien Binet <bi...@cern.ch> wrote: > > > On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > That seems buggy then. There is only 4.125 bytes of overhead per > > > string value on average (a 32-bit offset, plus a valid bit) > > > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper <djharpe...@gmail.com> > > > wrote: > > > > > > > > Uncompressed > > > > > > > > $ ls -la concurrent_streams.csv > > > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv > > > > > > > > $ wc -l concurrent_streams.csv > > > > 1007481 concurrent_streams.csv > > > > > > > > > > > > Daniel Harper > > > > http://djhworld.github.io > > > > > > > > > > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > > > I'm curious how the file is only 100MB if it's producing ~6GB of > > > > > strings in memory. Is it compressed? > > > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper < > djharpe...@gmail.com> > > > > > wrote: > > > > > > > > > > > > Thanks, > > > > > > > > > > > > I've tried the new code and that seems to have shaved about 1GB > of > > > memory > > > > > > off, so the heap is about 8.84GB now, here is the updated pprof > > > output > > > > > > https://i.imgur.com/itOHqBf.png > > > > > > > > > > > > It looks like the majority of allocations are in the > > > memory.GoAllocator > > > > > > > > > > > > (pprof) top > > > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total > > > > > > Showing top 10 nodes out of 41 > > > > > > flat flat% sum% cum cum% > > > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91% > > > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate > > > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97% > > > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer > > (inline) > > > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07% > > > > > > github.com/apache/arrow/go/arrow/array.NewData > > > > > > 0.83GB 9.38% 93.33% 0.83GB 9.38% > > > > > > github.com/apache/arrow/go/arrow/array.NewStringData > > > > > > 0.33GB 3.69% 97.02% 1.31GB 14.79% > > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData > > > > > > 0.18GB 2.04% 99.06% 0.18GB 2.04% > > > > > > github.com/apache/arrow/go/arrow/array.NewChunked > > > > > > 0.07GB 0.78% 99.85% 0.07GB 0.78% > > > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data > > > > > > 0.01GB 0.15% 100% 0.21GB 2.37% > > > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData > > > > > > 0 0% 100% 6GB 67.91% > > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append > > > > > > 0 0% 100% 4.03GB 45.54% > > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve > > > > > > > > > > > > > > > > > > I'm a bit busy at the moment but I'll probably repeat the same > test > > > on > > > > > the > > > > > > other Arrow implementations (e.g. Java) to see if they allocate a > > > similar > > > > > > amount. > > > > > > > I've implemented chunking over there: > > > > - https://github.com/apache/arrow/pull/3019 > > > > could you try with a couple of chunking values? > > e.g.: > > - csv.WithChunk(-1): reads the whole file into memory, creates one big > > record > > - csv.WithChunk(nrows/10): creates 10 records > > > > also, it would be great to try to disentangle the memory usage of the > "CSV > > reading part" from the "Table creation" one: > > - have some perf numbers w/o storing all these Records into a []Record > > slice, > > - have some perf numbers w/ only storing these Records into a []Record > > slice, > > - have some perf numbers w/ storing the records into the slice + creating > > the Table. > > > > hth, > > -s > > >